Thread: Incremental backups, and backup history
Hi, I have two backup questions, not much related to each other; here they are. First: With PostgreSQL, I can't do incremental backups. pg_dump will dump the entire database. Thus, if I want to keep daily backups on tape, I'm in trouble because I'll have to do a full backup every day, which may need several hours and several tapes. One workaround I'm thinking is to not store BLOBs in the database, but store them in the filesystem and store the filenames in the database instead. This needs some additional work (for example, a garbage collector to periodically delete unreferenced files), but will move a large amount of space from the database into the filesystem, which is capable of incremental backups. Only BLOBs, that is; for some tables that will have several tens of millions of small rows, I can't think of any workaround. Is this filenames-instead-of-BLOBs for easier backup common practice? Any other ideas or comments? My second question is a general relational database backup question, not specifically related to pgsql. Sometimes a user accidentally deletes/corrupts a file, and discovers it three days later. After they come panicing to me, I can give them their file as it was three days ago, because of the backup tape rotation. Now suppose a user deletes ten employees from the database, and three days later they understand that this was a bad thing. Now what? I can restore the entire database and make it as it was three days ago, but I can't restore the particular deleted records in the current database, as the relations make the database behave as a single unit. A colleague suggested, instead of updating or deleting rows, to only insert rows with a timestamp; for example, instead of updating the employee row with id=55, you insert a new row with id=55 with the updated data, and change the primary key to (id, dateinserted). You then always select the row with max dateinserted. A garbage collector is also needed to periodically delete obsolete rows older than, say, six months. Improvements can be made (such as using dateobsoleted instead of dateinserted or moving the old rows to another table), but even in the simplest cases I think it will be extremely hard to implement such a system, again because of the relations. So, it is a matter of database design? Do I have to design the database so that it keeps the history of what happened? Thanks everyone for the answers.
On 19 Jun 2003 at 11:42, Antonios Christofides wrote: > Hi, > > I have two backup questions, not much related to each other; here they > are. > > First: With PostgreSQL, I can't do incremental backups. pg_dump will > dump the entire database. Thus, if I want to keep daily backups on tape, > I'm in trouble because I'll have to do a full backup every day, which > may need several hours and several tapes. > > One workaround I'm thinking is to not store BLOBs in the database, but > store them in the filesystem and store the filenames in the database > instead. This needs some additional work (for example, a garbage > collector to periodically delete unreferenced files), but will move a > large amount of space from the database into the filesystem, which is > capable of incremental backups. Only BLOBs, that is; for some tables > that will have several tens of millions of small rows, I can't think of > any workaround. A cheaper solution in terms of efforts is to add a latest update timestamp to each blob and dump with a custom SQL. Incremental backups would be real nice. If you have real high turnover, I would suggest you to look asynchornous replication solution which work based on WAL sengmenting. That way you would backup only things that are changed. HTH Bye Shridhar -- Blore's Razor: Given a choice between two theories, take the one which is funnier.
On Thu, 2003-06-19 at 03:42, Antonios Christofides wrote: > Hi, > [snip] > My second question is a general relational database backup question, not > specifically related to pgsql. Sometimes a user accidentally > deletes/corrupts a file, and discovers it three days later. After they > come panicing to me, I can give them their file as it was three days > ago, because of the backup tape rotation. Now suppose a user deletes > ten employees from the database, and three days later they understand > that this was a bad thing. Now what? I can restore the entire database > and make it as it was three days ago, but I can't restore the particular > deleted records in the current database, as the relations make the > database behave as a single unit. One possibility that negates the need for "garbage collection" is to create a history table that mirrors the primary table, and also has an "activity_code char(1)", with values either "U" or "D". Then an "before update" or "before delete" trigger would automatically insert the pre-update-or-delete record into the history table, along with U or D in activity_code. Thus, you can look through the history table and find the accidentally deleted records. -- +-----------------------------------------------------------+ | Ron Johnson, Jr. Home: ron.l.johnson@cox.net | | Jefferson, LA USA http://members.cox.net/ron.l.johnson | | | | "Oh, great altar of passive entertainment, bestow upon me | | thy discordant images at such speed as to render linear | | thought impossible" (Calvin, regarding TV) | +-----------------------------------------------------------
On 19 Jun 2003 at 4:15, Ron Johnson wrote: > One possibility that negates the need for "garbage collection" is > to create a history table that mirrors the primary table, and > also has an "activity_code char(1)", with values either "U" or "D". > > Then an "before update" or "before delete" trigger would automatically > insert the pre-update-or-delete record into the history table, along > with U or D in activity_code. That makes me think. You can write before insert/before update/before delete triggers and use dblink in conjunction with that to mimick some kind of mirroring. Of course that is bad from keeping transactions in sync. But this remains an option. If you put remote database connection in async mode in dblink module, you can reduce the impact of remote operation as well. HTH Bye Shridhar -- Virtue is a relative term. -- Spock, "Friday's Child", stardate 3499.1
On Thu, Jun 19, 2003 at 11:42:28 +0300, Antonios Christofides <A.Christofides@itia.ntua.gr> wrote: > > My second question is a general relational database backup question, not > specifically related to pgsql. Sometimes a user accidentally > deletes/corrupts a file, and discovers it three days later. After they > come panicing to me, I can give them their file as it was three days > ago, because of the backup tape rotation. Now suppose a user deletes > ten employees from the database, and three days later they understand > that this was a bad thing. Now what? I can restore the entire database > and make it as it was three days ago, but I can't restore the particular > deleted records in the current database, as the relations make the > database behave as a single unit. This isn't a good situation to be in. If you log all of the transactions, one option may be to restore the database as of the last time it was valid, remove the bogus transactions from the transaction log and then replay the transactions from the time of the backup to the present. This will mean downtime for the database and it is possible that removing the bogus tranasctions will affect things as you reapply other transactions in ways you don't want. Another option is to figure out what the bogus tranasctions did (which keeping a history will help out with) and try to undo them as best as possible with new transactions. Either of the above are going to need manual oversight.
Antonios Christofides <A.Christofides@itia.ntua.gr> writes: > Is this filenames-instead-of-BLOBs for easier backup common practice? > Any other ideas or comments? This is a major point of contention. Some people think keeping all data in the database is a better approach, others think data that isn't inherently relational and doesn't need the protection of transactions doesn't really belong in the database. I happen to be firmly in the camp against putting such files in the database. But it depends a lot on what your needs are. In every case I've faced this it was simply useful to have the files accessible without piping them through the database protocol every time. I could point the web server at the directory and serve them up directly, or rsync them to the machines that could do that. They could even be served up from a separate lightweight web server without any database access altogether, which would have been impossible if the only way to access them was via the database. If you need to be able to update portions of your blobs, or if you need transactional safety then you may need the database. > My second question is a general relational database backup question, not > specifically related to pgsql. Sometimes a user accidentally > deletes/corrupts a file, and discovers it three days later. After they > come panicing to me, I can give them their file as it was three days > ago, because of the backup tape rotation. Now suppose a user deletes > ten employees from the database, and three days later they understand > that this was a bad thing. Now what? I can restore the entire database > and make it as it was three days ago, but I can't restore the particular > deleted records in the current database, as the relations make the > database behave as a single unit. What's worse is you may have several related changes to multiple tables. And then you won't know if any other changes to other tables were later done that depended on that data. There will be no way to do this perfectly in general. If your data is particularly amenable to this form of normalization then it can be useful though. For example, instead of storing counters that are incremented, even when the total is the only interesting statistic, I normally insert new records for every event. If ever it turns out something was wrong and the events should be disregarded for a particular time period, or with particular other properties, I have the complete history and can do that. Inserts are also more efficient for the database to do than updates. But for something like an employee table you're probably going to be stuck with restoring the table to a new name, and having a human look over the old data and the current data and updating the current data appropriately. You may want to consider having a "deleted" flag column and not ever deleting records. So at least undelete can be an easy operation that doesn't even need restoring backups. -- greg
On your second question: Keeping old data helps with data analysis, i.e., data mining. I would do the fired date as transactions. To see if an employeeis still and employee, look for the latest transation, hired, rehired, contracted with as a temp/consultant, fired,laid off, etc. Antonios Christofides wrote: > Hi, > > I have two backup questions, not much related to each other; here they > are. > > First: With PostgreSQL, I can't do incremental backups. pg_dump will > dump the entire database. Thus, if I want to keep daily backups on tape, > I'm in trouble because I'll have to do a full backup every day, which > may need several hours and several tapes. > > One workaround I'm thinking is to not store BLOBs in the database, but > store them in the filesystem and store the filenames in the database > instead. This needs some additional work (for example, a garbage > collector to periodically delete unreferenced files), but will move a > large amount of space from the database into the filesystem, which is > capable of incremental backups. Only BLOBs, that is; for some tables > that will have several tens of millions of small rows, I can't think of > any workaround. > > Is this filenames-instead-of-BLOBs for easier backup common practice? > Any other ideas or comments? > > My second question is a general relational database backup question, not > specifically related to pgsql. Sometimes a user accidentally > deletes/corrupts a file, and discovers it three days later. After they > come panicing to me, I can give them their file as it was three days > ago, because of the backup tape rotation. Now suppose a user deletes > ten employees from the database, and three days later they understand > that this was a bad thing. Now what? I can restore the entire database > and make it as it was three days ago, but I can't restore the particular > deleted records in the current database, as the relations make the > database behave as a single unit. > > A colleague suggested, instead of updating or deleting rows, to only > insert rows with a timestamp; for example, instead of updating the > employee row with id=55, you insert a new row with id=55 with the > updated data, and change the primary key to (id, dateinserted). You then > always select the row with max dateinserted. A garbage collector is also > needed to periodically delete obsolete rows older than, say, six months. > Improvements can be made (such as using dateobsoleted instead of > dateinserted or moving the old rows to another table), but even in the > simplest cases I think it will be extremely hard to implement such a > system, again because of the relations. > > So, it is a matter of database design? Do I have to design the database > so that it keeps the history of what happened? > > Thanks everyone for the answers. > > ---------------------------(end of broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match >
Regarding backup history: I have an application designed for novices. Apparently it's easy to hit the "Delete" button, and then say yes to the "Are you sure you want to delete this?" question even when they don't want to. Therefore I simply mark a record as deleted. For example, UPDATE table SET deleted='t' WHERE something=true; Then my application logic pretends it doesn't really exist until two days later the user decides they want it back. It works very well for me. -- Matthew Nuzum www.bearfruit.org cobalt@bearfruit.org > -----Original Message----- > From: Bruno Wolff III [mailto:bruno@wolff.to] > Sent: Thursday, June 19, 2003 6:59 AM > To: Antonios Christofides > Cc: pgsql-general@postgresql.org > Subject: Re: Incremental backups, and backup history > > On Thu, Jun 19, 2003 at 11:42:28 +0300, > Antonios Christofides <A.Christofides@itia.ntua.gr> wrote: > > > > My second question is a general relational database backup question, not > > specifically related to pgsql. Sometimes a user accidentally > > deletes/corrupts a file, and discovers it three days later. After they > > come panicing to me, I can give them their file as it was three days > > ago, because of the backup tape rotation. Now suppose a user deletes > > ten employees from the database, and three days later they understand > > that this was a bad thing. Now what? I can restore the entire database > > and make it as it was three days ago, but I can't restore the particular > > deleted records in the current database, as the relations make the > > database behave as a single unit. > > This isn't a good situation to be in. If you log all of the transactions, > one option may be to restore the database as of the last time it was > valid, remove the bogus transactions from the transaction log and then > replay > the transactions from the time of the backup to the present. > This will mean downtime for the database and it is possible that removing > the bogus tranasctions will affect things as you reapply other > transactions > in ways you don't want. > > Another option is to figure out what the bogus tranasctions did (which > keeping > a history will help out with) and try to undo them as best as possible > with > new transactions. > > Either of the above are going to need manual oversight.
On Thu, 19 Jun 2003, Matthew Nuzum wrote: > Regarding backup history: > > I have an application designed for novices. Apparently it's easy to hit the > "Delete" button, and then say yes to the "Are you sure you want to delete > this?" question even when they don't want to. Therefore I simply mark a > record as deleted. For example, > UPDATE table SET deleted='t' WHERE something=true; > > Then my application logic pretends it doesn't really exist until two days > later the user decides they want it back. > > It works very well for me. > But are you also taking care of the referential integrity issues, i.e. only disallowing tuples with a deleted = true from being referenced to and ensuring nothing references them at the time they are marked as deleted. It is a useful idea but as I know from a current project it requires reimplementing foreign key functionality. In this case the middleware only uses functions, one per statement, and nothing else, so I have been able to do much of this in those functions but it's still a pain. I even wrote a utility to take some of the leg work out of generating and maintaining quite a few functions but if I'd had time [and thought about these basically being foreign key constraints] I'd have looked at the existing foreign key code and seen if I could copy and amend it or just amend it in place. -- Nigel Andrews
that's a good point, ref integrity and 'deleted' items. I'll have to take a look at that as I make my next design. I'm surpirsedthat I didn't think of it. But I probably would have experienced it soon, as I am getting ready to put data in thedesign I'm on now. One way I know that makes it all easier, is to use surrogate integer keys on all tables, i.e. sequences, as the primary key. Nigel J. Andrews wrote: > On Thu, 19 Jun 2003, Matthew Nuzum wrote: > > >>Regarding backup history: >> >>I have an application designed for novices. Apparently it's easy to hit the >>"Delete" button, and then say yes to the "Are you sure you want to delete >>this?" question even when they don't want to. Therefore I simply mark a >>record as deleted. For example, >>UPDATE table SET deleted='t' WHERE something=true; >> >>Then my application logic pretends it doesn't really exist until two days >>later the user decides they want it back. >> >>It works very well for me. >> > > > But are you also taking care of the referential integrity issues, i.e. only > disallowing tuples with a deleted = true from being referenced to and ensuring > nothing references them at the time they are marked as deleted. > > It is a useful idea but as I know from a current project it requires > reimplementing foreign key functionality. In this case the middleware only uses > functions, one per statement, and nothing else, so I have been able to do much > of this in those functions but it's still a pain. I even wrote a utility to > take some of the leg work out of generating and maintaining quite a few > functions but if I'd had time [and thought about these basically being foreign > key constraints] I'd have looked at the existing foreign key code and seen if I > could copy and amend it or just amend it in place. > > > -- > Nigel Andrews > > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org >
I think somebody already suggested the following solution, which would solve referential integrity problems: - create a "history" table with the exact layout of the working table, but without it's FK constraints, and optionally with an insertion date; - create a trigger on the working table so that it inserts in the history table the current state on update/delete/(maybe insert); Now when you delete a row from the working table, you have all the benefits of FKs, and the history table provides all the history info you need. Recovery would be possible only manually copying back from the history table, with possible FK hassles, but the data would be there. Cheers, Csaba. On Fri, 2003-06-20 at 18:28, Dennis Gearon wrote: > that's a good point, ref integrity and 'deleted' items. I'll have to take a look at that as I make my next design. I'msurpirsed that I didn't think of it. But I probably would have experienced it soon, as I am getting ready to put datain the design I'm on now. > > One way I know that makes it all easier, is to use surrogate integer keys on all tables, i.e. sequences, as the primarykey. > > Nigel J. Andrews wrote: > > > On Thu, 19 Jun 2003, Matthew Nuzum wrote: > > > > > >>Regarding backup history: > >> > >>I have an application designed for novices. Apparently it's easy to hit the > >>"Delete" button, and then say yes to the "Are you sure you want to delete > >>this?" question even when they don't want to. Therefore I simply mark a > >>record as deleted. For example, > >>UPDATE table SET deleted='t' WHERE something=true; > >> > >>Then my application logic pretends it doesn't really exist until two days > >>later the user decides they want it back. > >> > >>It works very well for me. > >> > > > > > > But are you also taking care of the referential integrity issues, i.e. only > > disallowing tuples with a deleted = true from being referenced to and ensuring > > nothing references them at the time they are marked as deleted. > > > > It is a useful idea but as I know from a current project it requires > > reimplementing foreign key functionality. In this case the middleware only uses > > functions, one per statement, and nothing else, so I have been able to do much > > of this in those functions but it's still a pain. I even wrote a utility to > > take some of the leg work out of generating and maintaining quite a few > > functions but if I'd had time [and thought about these basically being foreign > > key constraints] I'd have looked at the existing foreign key code and seen if I > > could copy and amend it or just amend it in place. > > > > > > -- > > Nigel Andrews > > > > > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 6: Have you searched our list archives? > > > > http://archives.postgresql.org > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings >