Thread: Incremental backups, and backup history

Incremental backups, and backup history

From
Antonios Christofides
Date:
Hi,

I have two backup questions, not much related to each other; here they
are.

First: With PostgreSQL, I can't do incremental backups. pg_dump will
dump the entire database. Thus, if I want to keep daily backups on tape,
I'm in trouble because I'll have to do a full backup every day, which
may need several hours and several tapes.

One workaround I'm thinking is to not store BLOBs in the database, but
store them in the filesystem and store the filenames in the database
instead.  This needs some additional work (for example, a garbage
collector to periodically delete unreferenced files), but will move a
large amount of space from the database into the filesystem, which is
capable of incremental backups. Only BLOBs, that is; for some tables
that will have several tens of millions of small rows, I can't think of
any workaround.

Is this filenames-instead-of-BLOBs for easier backup common practice?
Any other ideas or comments?

My second question is a general relational database backup question, not
specifically related to pgsql. Sometimes a user accidentally
deletes/corrupts a file, and discovers it three days later. After they
come panicing to me, I can give them their file as it was three days
ago, because of the backup tape rotation. Now suppose a user deletes
ten employees from the database, and three days later they understand
that this was a bad thing. Now what? I can restore the entire database
and make it as it was three days ago, but I can't restore the particular
deleted records in the current database, as the relations make the
database behave as a single unit.

A colleague suggested, instead of updating or deleting rows, to only
insert rows with a timestamp; for example, instead of updating the
employee row with id=55, you insert a new row with id=55 with the
updated data, and change the primary key to (id, dateinserted). You then
always select the row with max dateinserted. A garbage collector is also
needed to periodically delete obsolete rows older than, say, six months.
Improvements can be made (such as using dateobsoleted instead of
dateinserted or moving the old rows to another table), but even in the
simplest cases I think it will be extremely hard to implement such a
system, again because of the relations.

So, it is a matter of database design? Do I have to design the database
so that it keeps the history of what happened?

Thanks everyone for the answers.

Re: Incremental backups, and backup history

From
"Shridhar Daithankar"
Date:
On 19 Jun 2003 at 11:42, Antonios Christofides wrote:

> Hi,
>
> I have two backup questions, not much related to each other; here they
> are.
>
> First: With PostgreSQL, I can't do incremental backups. pg_dump will
> dump the entire database. Thus, if I want to keep daily backups on tape,
> I'm in trouble because I'll have to do a full backup every day, which
> may need several hours and several tapes.
>
> One workaround I'm thinking is to not store BLOBs in the database, but
> store them in the filesystem and store the filenames in the database
> instead.  This needs some additional work (for example, a garbage
> collector to periodically delete unreferenced files), but will move a
> large amount of space from the database into the filesystem, which is
> capable of incremental backups. Only BLOBs, that is; for some tables
> that will have several tens of millions of small rows, I can't think of
> any workaround.

A cheaper solution in terms of efforts is to add a latest update timestamp to
each blob and dump with a custom SQL.

Incremental backups would be real nice. If you have real high turnover, I would
suggest you to look asynchornous replication solution which work based on WAL
sengmenting. That way you would backup only things that are changed.

HTH

Bye
 Shridhar

--
Blore's Razor:    Given a choice between two theories, take the one which is
funnier.


Re: Incremental backups, and backup history

From
Ron Johnson
Date:
On Thu, 2003-06-19 at 03:42, Antonios Christofides wrote:
> Hi,
>
[snip]
> My second question is a general relational database backup question, not
> specifically related to pgsql. Sometimes a user accidentally
> deletes/corrupts a file, and discovers it three days later. After they
> come panicing to me, I can give them their file as it was three days
> ago, because of the backup tape rotation. Now suppose a user deletes
> ten employees from the database, and three days later they understand
> that this was a bad thing. Now what? I can restore the entire database
> and make it as it was three days ago, but I can't restore the particular
> deleted records in the current database, as the relations make the
> database behave as a single unit.

One possibility that negates the need for "garbage collection" is
to create a history table that mirrors the primary table, and
also has an "activity_code char(1)", with values either "U" or "D".

Then an "before update" or "before delete" trigger would automatically
insert the pre-update-or-delete record into the history table, along
with U or D in activity_code.

Thus, you can look through the history table and find the accidentally
deleted records.

--
+-----------------------------------------------------------+
| Ron Johnson, Jr.     Home: ron.l.johnson@cox.net          |
| Jefferson, LA  USA   http://members.cox.net/ron.l.johnson |
|                                                           |
| "Oh, great altar of passive entertainment, bestow upon me |
|  thy discordant images at such speed as to render linear  |
|  thought impossible" (Calvin, regarding TV)               |
+-----------------------------------------------------------


Re: Incremental backups, and backup history

From
"Shridhar Daithankar"
Date:
On 19 Jun 2003 at 4:15, Ron Johnson wrote:
> One possibility that negates the need for "garbage collection" is
> to create a history table that mirrors the primary table, and
> also has an "activity_code char(1)", with values either "U" or "D".
>
> Then an "before update" or "before delete" trigger would automatically
> insert the pre-update-or-delete record into the history table, along
> with U or D in activity_code.

That makes me think. You can write before insert/before update/before delete
triggers and use dblink in conjunction with that to mimick some kind of
mirroring.

Of course that is bad from keeping transactions in sync. But this remains an
option. If you put remote database connection in async mode in dblink module,
you can reduce the impact of remote operation as well.

HTH

Bye
 Shridhar

--
Virtue is a relative term.        -- Spock, "Friday's Child", stardate 3499.1


Re: Incremental backups, and backup history

From
Bruno Wolff III
Date:
On Thu, Jun 19, 2003 at 11:42:28 +0300,
  Antonios Christofides <A.Christofides@itia.ntua.gr> wrote:
>
> My second question is a general relational database backup question, not
> specifically related to pgsql. Sometimes a user accidentally
> deletes/corrupts a file, and discovers it three days later. After they
> come panicing to me, I can give them their file as it was three days
> ago, because of the backup tape rotation. Now suppose a user deletes
> ten employees from the database, and three days later they understand
> that this was a bad thing. Now what? I can restore the entire database
> and make it as it was three days ago, but I can't restore the particular
> deleted records in the current database, as the relations make the
> database behave as a single unit.

This isn't a good situation to be in. If you log all of the transactions,
one option may be to restore the database as of the last time it was
valid, remove the bogus transactions from the transaction log and then replay
the transactions from the time of the backup to the present.
This will mean downtime for the database and it is possible that removing
the bogus tranasctions will affect things as you reapply other transactions
in ways you don't want.

Another option is to figure out what the bogus tranasctions did (which keeping
a history will help out with) and try to undo them as best as possible with
new transactions.

Either of the above are going to need manual oversight.

Re: Incremental backups, and backup history

From
Greg Stark
Date:
Antonios Christofides <A.Christofides@itia.ntua.gr> writes:

> Is this filenames-instead-of-BLOBs for easier backup common practice?
> Any other ideas or comments?

This is a major point of contention. Some people think keeping all data in the
database is a better approach, others think data that isn't inherently
relational and doesn't need the protection of transactions doesn't really
belong in the database.

I happen to be firmly in the camp against putting such files in the database.
But it depends a lot on what your needs are. In every case I've faced this it
was simply useful to have the files accessible without piping them through the
database protocol every time. I could point the web server at the directory
and serve them up directly, or rsync them to the machines that could do that.
They could even be served up from a separate lightweight web server without
any database access altogether, which would have been impossible if the only
way to access them was via the database.

If you need to be able to update portions of your blobs, or if you need
transactional safety then you may need the database.

> My second question is a general relational database backup question, not
> specifically related to pgsql. Sometimes a user accidentally
> deletes/corrupts a file, and discovers it three days later. After they
> come panicing to me, I can give them their file as it was three days
> ago, because of the backup tape rotation. Now suppose a user deletes
> ten employees from the database, and three days later they understand
> that this was a bad thing. Now what? I can restore the entire database
> and make it as it was three days ago, but I can't restore the particular
> deleted records in the current database, as the relations make the
> database behave as a single unit.

What's worse is you may have several related changes to multiple tables. And
then you won't know if any other changes to other tables were later done that
depended on that data. There will be no way to do this perfectly in general.

If your data is particularly amenable to this form of normalization then it
can be useful though. For example, instead of storing counters that are
incremented, even when the total is the only interesting statistic, I normally
insert new records for every event. If ever it turns out something was wrong
and the events should be disregarded for a particular time period, or with
particular other properties, I have the complete history and can do that.
Inserts are also more efficient for the database to do than updates.

But for something like an employee table you're probably going to be stuck
with restoring the table to a new name, and having a human look over the old
data and the current data and updating the current data appropriately.

You may want to consider having a "deleted" flag column and not ever deleting
records. So at least undelete can be an easy operation that doesn't even need
restoring backups.

--
greg

Re: Incremental backups, and backup history

From
Dennis Gearon
Date:
On your second question:

Keeping old data helps with data analysis, i.e., data mining. I would do the fired date as transactions. To see if an
employeeis still and employee, look for the latest transation, hired, rehired, contracted with as a temp/consultant,
fired,laid off, etc. 

Antonios Christofides wrote:

> Hi,
>
> I have two backup questions, not much related to each other; here they
> are.
>
> First: With PostgreSQL, I can't do incremental backups. pg_dump will
> dump the entire database. Thus, if I want to keep daily backups on tape,
> I'm in trouble because I'll have to do a full backup every day, which
> may need several hours and several tapes.
>
> One workaround I'm thinking is to not store BLOBs in the database, but
> store them in the filesystem and store the filenames in the database
> instead.  This needs some additional work (for example, a garbage
> collector to periodically delete unreferenced files), but will move a
> large amount of space from the database into the filesystem, which is
> capable of incremental backups. Only BLOBs, that is; for some tables
> that will have several tens of millions of small rows, I can't think of
> any workaround.
>
> Is this filenames-instead-of-BLOBs for easier backup common practice?
> Any other ideas or comments?
>
> My second question is a general relational database backup question, not
> specifically related to pgsql. Sometimes a user accidentally
> deletes/corrupts a file, and discovers it three days later. After they
> come panicing to me, I can give them their file as it was three days
> ago, because of the backup tape rotation. Now suppose a user deletes
> ten employees from the database, and three days later they understand
> that this was a bad thing. Now what? I can restore the entire database
> and make it as it was three days ago, but I can't restore the particular
> deleted records in the current database, as the relations make the
> database behave as a single unit.
>
> A colleague suggested, instead of updating or deleting rows, to only
> insert rows with a timestamp; for example, instead of updating the
> employee row with id=55, you insert a new row with id=55 with the
> updated data, and change the primary key to (id, dateinserted). You then
> always select the row with max dateinserted. A garbage collector is also
> needed to periodically delete obsolete rows older than, say, six months.
> Improvements can be made (such as using dateobsoleted instead of
> dateinserted or moving the old rows to another table), but even in the
> simplest cases I think it will be extremely hard to implement such a
> system, again because of the relations.
>
> So, it is a matter of database design? Do I have to design the database
> so that it keeps the history of what happened?
>
> Thanks everyone for the answers.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if your
>       joining column's datatypes do not match
>


Re: Incremental backups, and backup history

From
"Matthew Nuzum"
Date:
Regarding backup history:

I have an application designed for novices.  Apparently it's easy to hit the
"Delete" button, and then say yes to the "Are you sure you want to delete
this?" question even when they don't want to. Therefore I simply mark a
record as deleted.  For example,
UPDATE table SET deleted='t' WHERE something=true;

Then my application logic pretends it doesn't really exist until two days
later the user decides they want it back.

It works very well for me.

--
Matthew Nuzum
www.bearfruit.org
cobalt@bearfruit.org


> -----Original Message-----
> From: Bruno Wolff III [mailto:bruno@wolff.to]
> Sent: Thursday, June 19, 2003 6:59 AM
> To: Antonios Christofides
> Cc: pgsql-general@postgresql.org
> Subject: Re: Incremental backups, and backup history
>
> On Thu, Jun 19, 2003 at 11:42:28 +0300,
>   Antonios Christofides <A.Christofides@itia.ntua.gr> wrote:
> >
> > My second question is a general relational database backup question, not
> > specifically related to pgsql. Sometimes a user accidentally
> > deletes/corrupts a file, and discovers it three days later. After they
> > come panicing to me, I can give them their file as it was three days
> > ago, because of the backup tape rotation. Now suppose a user deletes
> > ten employees from the database, and three days later they understand
> > that this was a bad thing. Now what? I can restore the entire database
> > and make it as it was three days ago, but I can't restore the particular
> > deleted records in the current database, as the relations make the
> > database behave as a single unit.
>
> This isn't a good situation to be in. If you log all of the transactions,
> one option may be to restore the database as of the last time it was
> valid, remove the bogus transactions from the transaction log and then
> replay
> the transactions from the time of the backup to the present.
> This will mean downtime for the database and it is possible that removing
> the bogus tranasctions will affect things as you reapply other
> transactions
> in ways you don't want.
>
> Another option is to figure out what the bogus tranasctions did (which
> keeping
> a history will help out with) and try to undo them as best as possible
> with
> new transactions.
>
> Either of the above are going to need manual oversight.


Re: Incremental backups, and backup history

From
"Nigel J. Andrews"
Date:
On Thu, 19 Jun 2003, Matthew Nuzum wrote:

> Regarding backup history:
>
> I have an application designed for novices.  Apparently it's easy to hit the
> "Delete" button, and then say yes to the "Are you sure you want to delete
> this?" question even when they don't want to. Therefore I simply mark a
> record as deleted.  For example,
> UPDATE table SET deleted='t' WHERE something=true;
>
> Then my application logic pretends it doesn't really exist until two days
> later the user decides they want it back.
>
> It works very well for me.
>

But are you also taking care of the referential integrity issues, i.e. only
disallowing tuples with a deleted = true from being referenced to and ensuring
nothing references them at the time they are marked as deleted.

It is a useful idea but as I know from a current project it requires
reimplementing foreign key functionality. In this case the middleware only uses
functions, one per statement, and nothing else, so I have been able to do much
of this in those functions but it's still a pain. I even wrote a utility to
take some of the leg work out of generating and maintaining quite a few
functions but if I'd had time [and thought about these basically being foreign
key constraints] I'd have looked at the existing foreign key code and seen if I
could copy and amend it or just amend it in place.


--
Nigel Andrews



Re: Incremental backups, and backup history

From
Dennis Gearon
Date:
that's a good point, ref integrity and 'deleted' items. I'll have to take a look at that as I make my next design. I'm
surpirsedthat I didn't think of it. But I probably would have experienced it soon, as I am getting ready to put data in
thedesign I'm on now. 

One way I know that makes it all easier, is to use surrogate integer keys on all tables, i.e. sequences, as the primary
key.

Nigel J. Andrews wrote:

> On Thu, 19 Jun 2003, Matthew Nuzum wrote:
>
>
>>Regarding backup history:
>>
>>I have an application designed for novices.  Apparently it's easy to hit the
>>"Delete" button, and then say yes to the "Are you sure you want to delete
>>this?" question even when they don't want to. Therefore I simply mark a
>>record as deleted.  For example,
>>UPDATE table SET deleted='t' WHERE something=true;
>>
>>Then my application logic pretends it doesn't really exist until two days
>>later the user decides they want it back.
>>
>>It works very well for me.
>>
>
>
> But are you also taking care of the referential integrity issues, i.e. only
> disallowing tuples with a deleted = true from being referenced to and ensuring
> nothing references them at the time they are marked as deleted.
>
> It is a useful idea but as I know from a current project it requires
> reimplementing foreign key functionality. In this case the middleware only uses
> functions, one per statement, and nothing else, so I have been able to do much
> of this in those functions but it's still a pain. I even wrote a utility to
> take some of the leg work out of generating and maintaining quite a few
> functions but if I'd had time [and thought about these basically being foreign
> key constraints] I'd have looked at the existing foreign key code and seen if I
> could copy and amend it or just amend it in place.
>
>
> --
> Nigel Andrews
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
>                http://archives.postgresql.org
>


Re: Incremental backups, and backup history

From
Csaba Nagy
Date:
I think somebody already suggested the following solution, which would
solve referential integrity problems:
 - create a "history" table with the exact layout of the working table,
but without it's FK constraints, and optionally with an insertion date;
 - create a trigger on the working table so that it inserts in the
history table the current state on update/delete/(maybe insert);
Now when you delete a row from the working table, you have all the
benefits of FKs, and the history table provides all the history info you
need.
Recovery would be possible only manually copying back from the history
table, with possible FK hassles, but the data would be there.

Cheers,
Csaba.

On Fri, 2003-06-20 at 18:28, Dennis Gearon wrote:
> that's a good point, ref integrity and 'deleted' items. I'll have to take a look at that as I make my next design.
I'msurpirsed that I didn't think of it. But I probably would have experienced it soon, as I am getting ready to put
datain the design I'm on now. 
>
> One way I know that makes it all easier, is to use surrogate integer keys on all tables, i.e. sequences, as the
primarykey. 
>
> Nigel J. Andrews wrote:
>
> > On Thu, 19 Jun 2003, Matthew Nuzum wrote:
> >
> >
> >>Regarding backup history:
> >>
> >>I have an application designed for novices.  Apparently it's easy to hit the
> >>"Delete" button, and then say yes to the "Are you sure you want to delete
> >>this?" question even when they don't want to. Therefore I simply mark a
> >>record as deleted.  For example,
> >>UPDATE table SET deleted='t' WHERE something=true;
> >>
> >>Then my application logic pretends it doesn't really exist until two days
> >>later the user decides they want it back.
> >>
> >>It works very well for me.
> >>
> >
> >
> > But are you also taking care of the referential integrity issues, i.e. only
> > disallowing tuples with a deleted = true from being referenced to and ensuring
> > nothing references them at the time they are marked as deleted.
> >
> > It is a useful idea but as I know from a current project it requires
> > reimplementing foreign key functionality. In this case the middleware only uses
> > functions, one per statement, and nothing else, so I have been able to do much
> > of this in those functions but it's still a pain. I even wrote a utility to
> > take some of the leg work out of generating and maintaining quite a few
> > functions but if I'd had time [and thought about these basically being foreign
> > key constraints] I'd have looked at the existing foreign key code and seen if I
> > could copy and amend it or just amend it in place.
> >
> >
> > --
> > Nigel Andrews
> >
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 6: Have you searched our list archives?
> >
> >                http://archives.postgresql.org
> >
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
>