Thread: planned recovery from a certain transaction
I know questions like this have been asked before, but I hadn’t seen one quite from the same perspective (although I’m sure it’s out there somewhere)…
We have a database which has one long involved procedure early in the morning that updates all sorts of things, moves data around, deletes some stuff, alters some DDL - you name it, it does it. The rest of the day, the database is read only. Autovacuum is not on, it was killing performance for it to kick on in the middle of the proc. Vacuum is run right before the long procedure. We typically wait until the guy onsite verifies the procedure went smoothly and then vacuum (Why? Because we’ve read enough of “well, you would have been able to restore that if the autovacuum wasn’t running”. We have a backup so this is just a checking before vacuuming is technically unneeded.). Don’t get me wrong, I’m a big autovacuum fan, just not for this specific case.
The transaction itself works flawlessly, but every once and awhile the data the it uploads from comes in flawed and we have to find a way to reset it. This reset involves restoring a backup that was taken right before the proc started. If we had the xid of the long running transaction, is there a better way to reset it right before that transaction happened? Restoring the backup is a lengthy process because several of the tables that are affected are rather large.
Chris Spotts
Chris Spotts wrote: > > The transaction itself works flawlessly, but every once and awhile the data > the it uploads from comes in flawed and we have to find a way to reset it. > This reset involves restoring a backup that was taken right before the proc > started. If we had the xid of the long running transaction, is there a > better way to reset it right before that transaction happened? Restoring > the backup is a lengthy process because several of the tables that are > affected are rather large. No way really to "rewind" to a previous transaction (although I believe the original academic code PostgreSQL is based on could do this sort of thing). I'd look at doing a PITR backup (full+WAL) just before the long transaction is started. Alternatively, if you can afford the downtime you could just stop the database server and take a snapshot of all the DB files (as for PITR). Use of rsync or filesystems that handle snapshots would make both of these reasonably fast. Restores are just a matter of moving the files / pointing PG at the backup set and starting it - effectively instant. -- Richard Huxton Archonet Ltd
On Thursday 25 June 2009, "Chris Spotts" <rfusca@gmail.com> wrote: > The transaction itself works flawlessly, but every once and awhile the > data the it uploads from comes in flawed and we have to find a way to > reset it. This reset involves restoring a backup that was taken right > before the proc started. If we had the xid of the long running > transaction, is there a better way to reset it right before that > transaction happened? Restoring the backup is a lengthy process because > several of the tables that are affected are rather large. > PITR would permit you to restore the database to a point in time or a transaction ID. You could probably do something with filesystem snapshots to minimize backup/restore time for it. -- WARNING: Do not look into laser with remaining eye.
Chris Spotts escribió: > The transaction itself works flawlessly, but every once and awhile the data > the it uploads from comes in flawed and we have to find a way to reset it. > This reset involves restoring a backup that was taken right before the proc > started. If we had the xid of the long running transaction, is there a > better way to reset it right before that transaction happened? Restoring > the backup is a lengthy process because several of the tables that are > affected are rather large. You could mark it aborted in pg_clog, assuming none of the tuples it touched have been examined by anyone else after it finished. Since you likely want to crosscheck the data (thus examine it, which sets its hint bits), it's going to be very hard to do. Another idea would be to use PITR to restore to the time just before the transaction, but that's going to be painful too because restoring from a base backup is going to take long for your big tables. Lastly, you could use a filesystem snapshot taken just before the long procedure, to which to revert if you don't like how it went. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Thu, Jun 25, 2009 at 9:40 AM, Chris Spotts<rfusca@gmail.com> wrote: > > > We have a database which has one long involved procedure early in the > morning that updates all sorts of things, moves data around, deletes some > stuff, alters some DDL - you name it, it does it. The rest of the day, the > database is read only. Autovacuum is not on, it was killing performance for > it to kick on in the middle of the proc. Vacuum is run right before the > long procedure. We typically wait until the guy onsite verifies the > procedure went smoothly and then vacuum (Why? Because we’ve read enough of > “well, you would have been able to restore that if the autovacuum wasn’t > running”. We have a backup so this is just a checking before vacuuming is > technically unneeded.). Don’t get me wrong, I’m a big autovacuum fan, just > not for this specific case. > > > > The transaction itself works flawlessly, but every once and awhile the data > the it uploads from comes in flawed and we have to find a way to reset it. > This reset involves restoring a backup that was taken right before the > proc started. If we had the xid of the long running transaction, is there > a better way to reset it right before that transaction happened? Restoring > the backup is a lengthy process because several of the tables that are > affected are rather large. Assuming that the data is mostly created from whole cloth each morning, it might do to have two dbs, and rename one to replace the other when you're ready. Gives you 20 or so hours to discover a screw up and still have the backup db before you toss it away to build the next day's db.
>> The transaction itself works flawlessly, but every once and awhile the data >> the it uploads from comes in flawed and we have to find a way to reset it. If you can automate the tests for the flaws you can do the whole transaction itself as one big transaction in Postgres. Even DDL can be done in transactions in Postgres. Then run the tests, still in the same transaction, and if you detect anything wrong roll the whole shebang back. -- greg http://mit.edu/~gsstark/resume.pdf
> > Assuming that the data is mostly created from whole cloth each > morning, it might do to have two dbs, and rename one to replace the > other when you're ready. Gives you 20 or so hours to discover a screw > up and still have the backup db before you toss it away to build the > next day For this database, we're dumping in the neighborhood of 75GB of data each morning. This is mostly data warehousing. I'm not quite sure how effective its going to be to do a file system snapshot since the whole db is so large. The aborted transaction in the pg_clog sounds similar to what I was dreaming of, but "very hard to do" doesn't sound real promising. Here's the approach we've come up with today to help combat it and let me know you're thoughts on it. We're mostly warehousing of logistics data. So we have an import table that the files are uploaded to. We have a set of tables for today's data processing, a set of tables for the results of yesterday (we do some crossday day comparisions that make it needed), and a set of partitioned archive tables. The data is uploaded into the import table, the previous day tables are moved to history, and then the current day tables are moved to previous day, leaving the current day tables ready to be populated. The import table is s then normalized into several current day tables, mostly split into dynamic and static data. Some fields are normalized to reduce size, ie we store a city_id int instead of a 60 character field. Reports and stats are generated off the current days records and the comparision of the two days. Because we calculate certain things like durations of certain statuses, its not as simple as just "delete everything with a certain date and then copy the appropriate dates from history into current and previous respectively." We effectively alter some fields that cannot be "unaltered" short of a restore of some kind. Due to speed concerns, we've not found a way around this. We'd like to store the needed fields to calculate the durations after end of the duration, but that can be quite lengthy. Instead triggers update a table that tracks key events. This all is what we currently do. This means that we'd have to have a HUGE filesystem snapshot to do it like that. To do a pg_dump restore, we have to restore one of the partitions that is 250GB (2 weeks of data partitioned) - its a lengthy process. We're looking at adding a set of daily queue tables that sits right before archive. Writing the files to these tables and then if everything checks out, dump to history. This way, we should only ever have to have a dump of the current, previous, and queue tables to be able to "undo". Suggestions?
Attachment
On Thu, 2009-06-25 at 21:59 +0100, Greg Stark wrote: > >> The transaction itself works flawlessly, but every once and awhile the data > >> the it uploads from comes in flawed and we have to find a way to reset it. > > If you can automate the tests for the flaws you can do the whole > transaction itself as one big transaction in Postgres. Even DDL can be > done in transactions in Postgres. Then run the tests, still in the > same transaction, and if you detect anything wrong roll the whole > shebang back. > We automate dozens of tests for the data, very first thing all in one large transaction. We love the fact that it undoes DDL and have taken full advantage of it. But we get "these stats don't look right" and then go look at the import and find out that the context of the data was incorrect. I can't imagine how to write the tests for the stuff we find wrong. If anybody has an extra AI system laying around, let me know ;).