Thread: Patch for disaster recovery
I have added this define to help people recover deleted tuples. Of course it is only to be used for disaster recovery. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073 Index: src/backend/utils/time/tqual.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/utils/time/tqual.c,v retrieving revision 1.81 diff -c -c -r1.81 tqual.c *** src/backend/utils/time/tqual.c 31 Dec 2004 22:02:56 -0000 1.81 --- src/backend/utils/time/tqual.c 20 Feb 2005 04:52:13 -0000 *************** *** 776,781 **** --- 776,786 ---- HeapTupleSatisfiesSnapshot(HeapTupleHeader tuple, Snapshot snapshot, Buffer buffer) { + /* This is to be used only for disaster recovery and requires serious analysis. */ + #ifdef MAKE_ALL_TUPLES_VISIBLE + return true; + #endif + if (!(tuple->t_infomask & HEAP_XMIN_COMMITTED)) { if (tuple->t_infomask & HEAP_XMIN_INVALID)
On Sat, Feb 19, 2005 at 11:56:28PM -0500, Bruce Momjian wrote: > I have added this define to help people recover deleted tuples. > Of course it is only to be used for disaster recovery. Wouldn't the patch as provided require a special build of the postmaster, useful only for disaster recovery? What about making the behavior configurable at run time, say via a command-line option? The code could remain in an #ifdef block and be included or not included at build time depending on a configure script option. People who don't want the functionality at all could omit it from their build (or not enable it, depending on the default), while people who do want it could use the same postmaster for normal operation and for recovery, the latter function requiring starting the postmaster with a "disaster recovery mode" option. Just brainstorming. Thoughts? -- Michael Fuhr http://www.fuhr.org/~mfuhr/
Bruce Momjian <pgman@candle.pha.pa.us> writes: > + /* This is to be used only for disaster recovery and requires serious analysis. */ > + #ifdef MAKE_ALL_TUPLES_VISIBLE > + return true; > + #endif If this requires serious analysis, why did you install the patch without any public discussion? I cannot imagine any situation where this wouldn't break things beyond belief, and so I request that you revert the patch until you've provided a real example in which it would help. regards, tom lane
On Sat, Feb 19, 2005 at 10:56:46PM -0700, Michael Fuhr wrote: > > What about making the behavior configurable at run time, say via a > command-line option? Such functionality should be easy to backpatch to earlier versions, shouldn't it? Pleas for help could then be answered with, "upgrade to the latest version in the branch you're running, build with this configure option, start the postmaster with this run-time option." -- Michael Fuhr http://www.fuhr.org/~mfuhr/
On Sat, Feb 19, 2005 at 11:07:55PM -0700, Michael Fuhr wrote: > On Sat, Feb 19, 2005 at 10:56:46PM -0700, Michael Fuhr wrote: > > > > What about making the behavior configurable at run time, say via a > > command-line option? > > Such functionality should be easy to backpatch to earlier versions, > shouldn't it? Hmmm...after seeing Tom's reply, I suppose I should have first asked, "Gee, looks simple, but does it work?" Should I even bother experimenting with it in a test environment? -- Michael Fuhr http://www.fuhr.org/~mfuhr/
Michael Fuhr <mike@fuhr.org> writes: > Hmmm...after seeing Tom's reply, I suppose I should have first > asked, "Gee, looks simple, but does it work?" Should I even bother > experimenting with it in a test environment? Experiment away, but I have a hard time visualizing how you'll find it useful. Just brainstorming here, but it seems to me that what might make some kind of sense is a command to "undelete all tuples in this table". You do that, you look through them, you delete the versions you don't want, you're happy. The problem with the patch as-is is that (a) "deleting the versions you don't want" is a no-op, so you cannot keep straight what you've done in terms of filtering out garbage; and (b) when you revert to a non-broken postmaster, the stuff you couldn't see before goes back to being unseeable, because after all you didn't change its state. With either the snapshot kluge or the undelete-all kluge, you have an issue in that constraints are broken wholesale --- you can see lots of duplicate row versions that violate unique constraints, deleted versions that violate FK constraints because they reference also-deleted master rows, deleted versions that violate later-added CHECK constraints, etc. I'd sort of like to see the system flip into some mode that says "we're not promising constraints are honored", and then you can't go back to normal operation without going through some pushup that checks all the remaining rows satisfy the declared constraints. In any case I suspect it's a bad idea to treat tuples as good if their originating transaction did not commit. For starters, such a tuple might not possess all the index entries it should (if the originating transaction failed before inserting said entries). I think what we want to think about is overriding delete commands, not overriding insert failures. Not sure where this leads to, but it's not leading to an undocumented one-line hack in tqual.c, and definitely not *that* one-line hack. regards, tom lane
On Sun, Feb 20, 2005 at 02:21:00AM -0500, Tom Lane wrote: > Just brainstorming here, but it seems to me that what might make some > kind of sense is a command to "undelete all tuples in this table". Would it be possible to start a transaction with specific snapshot parameters? Could you "time warp" that way, assuming that what you're looking for hasn't already been vacuumed away? Rather than modifying anything, you'd SELECT or COPY the data somewhere and then restore it when you came "back to the future." Just another brainstorm...certainly not thought very far through yet. -- Michael Fuhr http://www.fuhr.org/~mfuhr/
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > + /* This is to be used only for disaster recovery and requires serious analysis. */ > > + #ifdef MAKE_ALL_TUPLES_VISIBLE > > + return true; > > + #endif > > If this requires serious analysis, why did you install the patch without > any public discussion? > > I cannot imagine any situation where this wouldn't break things beyond > belief, and so I request that you revert the patch until you've provided > a real example in which it would help. We regularly have people on IRC who delete data and then want to recover it. By having the define it makes it easier for us to help them without them having to add actual C code. Does that make sense? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Tom Lane wrote: > Michael Fuhr <mike@fuhr.org> writes: > > Hmmm...after seeing Tom's reply, I suppose I should have first > > asked, "Gee, looks simple, but does it work?" Should I even bother > > experimenting with it in a test environment? > > Experiment away, but I have a hard time visualizing how you'll find it > useful. > > Just brainstorming here, but it seems to me that what might make some > kind of sense is a command to "undelete all tuples in this table". > You do that, you look through them, you delete the versions you don't > want, you're happy. The problem with the patch as-is is that (a) > "deleting the versions you don't want" is a no-op, so you cannot keep > straight what you've done in terms of filtering out garbage; and (b) > when you revert to a non-broken postmaster, the stuff you couldn't see > before goes back to being unseeable, because after all you didn't change > its state. It would be used only for dumping the table out and reverting to the regular postmaster. > With either the snapshot kluge or the undelete-all kluge, you have an > issue in that constraints are broken wholesale --- you can see lots of > duplicate row versions that violate unique constraints, deleted versions > that violate FK constraints because they reference also-deleted master > rows, deleted versions that violate later-added CHECK constraints, etc. > I'd sort of like to see the system flip into some mode that says "we're > not promising constraints are honored", and then you can't go back to > normal operation without going through some pushup that checks all the > remaining rows satisfy the declared constraints. > > In any case I suspect it's a bad idea to treat tuples as good if their > originating transaction did not commit. For starters, such a tuple > might not possess all the index entries it should (if the originating > transaction failed before inserting said entries). I think what we > want to think about is overriding delete commands, not overriding > insert failures. > > Not sure where this leads to, but it's not leading to an undocumented > one-line hack in tqual.c, and definitely not *that* one-line hack. Yea, it should be more precise about showing only delete stuff and not uncommitted creates. Let me work on that. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Tom Lane wrote: > Not sure where this leads to, but it's not leading to an undocumented > one-line hack in tqual.c, and definitely not *that* one-line hack. OK, it turns out the bottom of the function is the right place to fix this. Patch attached and applied. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073 Index: src/backend/utils/time/tqual.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/utils/time/tqual.c,v retrieving revision 1.82 diff -c -c -r1.82 tqual.c *** src/backend/utils/time/tqual.c 20 Feb 2005 04:56:00 -0000 1.82 --- src/backend/utils/time/tqual.c 20 Feb 2005 14:56:33 -0000 *************** *** 776,786 **** HeapTupleSatisfiesSnapshot(HeapTupleHeader tuple, Snapshot snapshot, Buffer buffer) { - /* This is to be used only for disaster recovery and requires serious analysis. */ - #ifdef MAKE_ALL_TUPLES_VISIBLE - return true; - #endif - if (!(tuple->t_infomask & HEAP_XMIN_COMMITTED)) { if (tuple->t_infomask & HEAP_XMIN_INVALID) --- 776,781 ---- *************** *** 965,971 **** --- 960,971 ---- } } + /* This is to be used only for disaster recovery and requires serious analysis. */ + #ifdef MAKE_ALL_TUPLES_VISIBLE return false; + #else + return true; + #endif }
Tom Lane wrote: > Not sure where this leads to, but it's not leading to an undocumented > one-line hack in tqual.c, and definitely not *that* one-line hack. Sorry, here is the proper change I just applied: /* This is to be used only for disaster recovery and requires serious analysis. */ #ifndef MAKE_ALL_TUPLES_VISIBLE return false; #else return true; #endif -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Sun, Feb 20, 2005 at 09:43:11 -0500, Bruce Momjian <pgman@candle.pha.pa.us> wrote: > > We regularly have people on IRC who delete data and then want to recover > it. By having the define it makes it easier for us to help them without > them having to add actual C code. > > Does that make sense? You aren't going to get a consistant snapshot if you get back all of the deleted rows. With autovacuum it is going to get harder to do this, because accidentally making large changes in a table is going to trigger a vacuum. It seems like the right way to do this is a recovery using the PITR system and putting effort into making that easier is the way to go.
Bruno Wolff III wrote: > On Sun, Feb 20, 2005 at 09:43:11 -0500, > Bruce Momjian <pgman@candle.pha.pa.us> wrote: > > > > We regularly have people on IRC who delete data and then want to recover > > it. By having the define it makes it easier for us to help them without > > them having to add actual C code. > > > > Does that make sense? > > You aren't going to get a consistant snapshot if you get back all of the > deleted rows. With autovacuum it is going to get harder to do this, because > accidentally making large changes in a table is going to trigger a vacuum. Right, but as far as I remember the vacuum isn't going to reuse the rows. Rather, it is just going to mark them for later reuse. > It seems like the right way to do this is a recovery using the PITR > system and putting effort into making that easier is the way to go. You are assuming someone is running PITR already, which they might not. This define is just so we can help folks recover deleted stuff under our guidance. Also, I think a general GUC mechanism is too dangerous. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Tom Lane wrote: >> Not sure where this leads to, but it's not leading to an undocumented >> one-line hack in tqual.c, and definitely not *that* one-line hack. > Sorry, here is the proper change I just applied: > /* This is to be used only for disaster recovery and requires serious analysis. */ > #ifndef MAKE_ALL_TUPLES_VISIBLE > return false; > #else > return true; > #endif AFAICS this has no value whatsoever. Assuming that someone has a disaster recovery problem on their hands, how likely is it that they will know that that code is there, or be able to turn it on (most users don't compile from source anymore), or be able to use it effectively, given the complete lack of documentation? As is, this is of value only to someone familiar with the code, and such a someone could go in and modify the logic for themselves just as easily as turn on a #define. I think the only real effect of this patch will be to confuse people who are reading the source code. tqual.c is already complicated and fragile enough --- it doesn't need conditionally compiled "features" that we can't even explain the use of. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Tom Lane wrote: > >> Not sure where this leads to, but it's not leading to an undocumented > >> one-line hack in tqual.c, and definitely not *that* one-line hack. > > > Sorry, here is the proper change I just applied: > > > /* This is to be used only for disaster recovery and requires serious analysis. */ > > #ifndef MAKE_ALL_TUPLES_VISIBLE > > return false; > > #else > > return true; > > #endif > > AFAICS this has no value whatsoever. Assuming that someone has a > disaster recovery problem on their hands, how likely is it that they > will know that that code is there, or be able to turn it on (most > users don't compile from source anymore), or be able to use it > effectively, given the complete lack of documentation? As is, this > is of value only to someone familiar with the code, and such a someone > could go in and modify the logic for themselves just as easily as > turn on a #define. > > I think the only real effect of this patch will be to confuse people > who are reading the source code. tqual.c is already complicated and > fragile enough --- it doesn't need conditionally compiled "features" > that we can't even explain the use of. I need a note somewhere to remember where to tell people to modify the code to recovery something. Do you have a better idea? You want just a comment rather than a define? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian wrote: > Tom Lane wrote: >>I think the only real effect of this patch will be to confuse people >>who are reading the source code. tqual.c is already complicated and >>fragile enough --- it doesn't need conditionally compiled "features" >>that we can't even explain the use of. > > > I need a note somewhere to remember where to tell people to modify the > code to recovery something. Do you have a better idea? You want just a > comment rather than a define? A short guide to disaster recovery would be useful. There's a real shortage of people qualified to help users in this situation. I feel comfortable in telling them what *not* to do, but nothing more. -- Richard Huxton Archonet Ltd
On Sun, 2005-02-20 at 13:08 -0500, Bruce Momjian wrote: > Bruno Wolff III wrote: > > On Sun, Feb 20, 2005 at 09:43:11 -0500, > > Bruce Momjian <pgman@candle.pha.pa.us> wrote: > > > > > > We regularly have people on IRC who delete data and then want to recover > > > it. That is (one of) the purpose(s) of PITR. > By having the define it makes it easier for us to help them without > > > them having to add actual C code. > > > > > > Does that make sense? > > > > You aren't going to get a consistant snapshot if you get back all of the > > deleted rows. With autovacuum it is going to get harder to do this, because > > accidentally making large changes in a table is going to trigger a vacuum. > > Right, but as far as I remember the vacuum isn't going to reuse the > rows. Rather, it is just going to mark them for later reuse. > > > It seems like the right way to do this is a recovery using the PITR > > system and putting effort into making that easier is the way to go. > > You are assuming someone is running PITR already, which they might not. I do hope you/we all recommend PITR as the supported route for recovery situations?? Otherwise it never will get easier. If PITR were more clearly recommended, then perhaps more people would think about about using it ahead of time, just like everybody does with commercial databases. Best Regards, Simon Riggs