Thread: Help me recovering data
Hi folks, I ran into big trouble - it seems that my DB is lost. "select * from pg_database" gives me 0 rows, but I still can connect to databases with \c and even select from tables there, although they're also not visible with \dt. After asking the guys in the #postgresql@irc.freenode.net channel they told me that the reason is the "Transaction ID wraparound", because I have never ran VACUUM on the whole database. So they proposed to ask here for help. I have stopped the server, but what could I do in order to save the data if it's possible at all? You could also take a look at my pg_controldata output: http://rafb.net/paste/results/ghcIb880.html Regards, Kouber Saparev
"Kouber Saparev" <postgresql@saparev.com> writes: > After asking the guys in the #postgresql@irc.freenode.net channel they told > me that the reason is the "Transaction ID wraparound", because I have never > ran VACUUM on the whole database. > So they proposed to ask here for help. I have stopped the server, but what > could I do in order to save the data if it's possible at all? I think you're pretty well screwed as far as getting it *all* back goes, but you could use pg_resetxlog to back up the NextXID counter enough to make your tables and databases reappear (and thereby lose the effects of however many recent transactions you back up over). Once you've found a NextXID setting you like, I'd suggest an immediate pg_dumpall/initdb/reload to make sure you have a consistent set of data. Don't VACUUM, or indeed modify the DB at all, until you have gotten a satisfactory dump. Then put in a cron job to do periodic vacuuming ;-) regards, tom lane
> Once you've found a NextXID setting you like, I'd suggest an immediate > pg_dumpall/initdb/reload to make sure you have a consistent set of data. > Don't VACUUM, or indeed modify the DB at all, until you have gotten a > satisfactory dump. > > Then put in a cron job to do periodic vacuuming ;-) Thank you, I just discovered in the mailing lists that I'm not the first nor the last guy that have forgotten to VACUUM the database. ;-) Regards, Kouber Saparev
> I think you're pretty well screwed as far as getting it *all* back goes, > but you could use pg_resetxlog to back up the NextXID counter enough to > make your tables and databases reappear (and thereby lose the effects of > however many recent transactions you back up over). > > Once you've found a NextXID setting you like, I'd suggest an immediate > pg_dumpall/initdb/reload to make sure you have a consistent set of data. > Don't VACUUM, or indeed modify the DB at all, until you have gotten a > satisfactory dump. > > Then put in a cron job to do periodic vacuuming ;-) This might seem like a stupid question, but since this is a massive data loss potential in PostgreSQL, what's so hard about having the checkpointer or something check the transaction counter when it runs and either issue a db-wide vacuum if it's about towrap, or simply disallow any new transactions? I think people'd rather their db just stopped accepting new transactions rather than just losing data... Chris
It must be possible to create a tool based on the PostgreSQL sources that can read all the tuples in a database and dump them to a file stream. All the data remains in the file until overwritten with data after a vacuum. It *should* be doable. If there data in the table is worth anything, then it would be worth extracting. It would, of course, be a tool of last resort. > "Kouber Saparev" <postgresql@saparev.com> writes: >> After asking the guys in the #postgresql@irc.freenode.net channel they >> told >> me that the reason is the "Transaction ID wraparound", because I have >> never >> ran VACUUM on the whole database. > >> So they proposed to ask here for help. I have stopped the server, but >> what >> could I do in order to save the data if it's possible at all? > > I think you're pretty well screwed as far as getting it *all* back goes, > but you could use pg_resetxlog to back up the NextXID counter enough to > make your tables and databases reappear (and thereby lose the effects of > however many recent transactions you back up over). > > Once you've found a NextXID setting you like, I'd suggest an immediate > pg_dumpall/initdb/reload to make sure you have a consistent set of data. > Don't VACUUM, or indeed modify the DB at all, until you have gotten a > satisfactory dump. > > Then put in a cron job to do periodic vacuuming ;-) > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org >
Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes: > This might seem like a stupid question, but since this is a massive > data loss potential in PostgreSQL, what's so hard about having the > checkpointer or something check the transaction counter when it runs > and either issue a db-wide vacuum if it's about to wrap, or simply > disallow any new transactions? I think autovac-in-backend is the preferred solution to this, and it's definitely on the TODO list... -Doug
>> I think you're pretty well screwed as far as getting it *all* back goes, >> but you could use pg_resetxlog to back up the NextXID counter enough to >> make your tables and databases reappear (and thereby lose the effects of >> however many recent transactions you back up over). >> >> Once you've found a NextXID setting you like, I'd suggest an immediate >> pg_dumpall/initdb/reload to make sure you have a consistent set of data. >> Don't VACUUM, or indeed modify the DB at all, until you have gotten a >> satisfactory dump. >> >> Then put in a cron job to do periodic vacuuming ;-) > > This might seem like a stupid question, but since this is a massive data > loss potential in PostgreSQL, what's so hard about having the > checkpointer or something check the transaction counter when it runs and > either issue a db-wide vacuum if it's about to wrap, or simply > disallow any new transactions? > > I think people'd rather their db just stopped accepting new transactions > rather than just losing data... > I would certainly prefer the system to issue an error and stop working than complete data loss.
Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes: > This might seem like a stupid question, but since this is a massive data > loss potential in PostgreSQL, what's so hard about having the > checkpointer or something check the transaction counter when it runs and > either issue a db-wide vacuum if it's about to wrap, or simply > disallow any new transactions? The checkpointer is entirely incapable of either detecting the problem (it doesn't have enough infrastructure to examine pg_database in a reasonable way) or preventing backends from doing anything if it did know there was a problem. > I think people'd rather their db just stopped accepting new transactions > rather than just losing data... Not being able to issue new transactions *is* data loss --- how are you going to get the system out of that state? autovacuum is the correct long-term solution to this, not some kind of automatic hara-kiri. regards, tom lane
> Not being able to issue new transactions *is* data loss --- how are you > going to get the system out of that state? Yes, but I also would prefer the server to say something as "The database is full, please vacuum." - the same as when the hard disk is full and you try to record something on it - it's not exactly data loss, just an incapability to continue the job. The thing is that a warning is issued only when you start the vacuum itself: play=# VACUUM; WARNING: some databases have not been vacuumed in 1613770184 transactions HINT: Better vacuum them within 533713463 transactions, or you may have a wraparound failure. VACUUM So, it's something like the chicken and the egg problem, you have to vacuum in order to receive a message that you had to do it earlier, but sometimes it's just too late. As it was in my case, I have just discovered that almost all of my data is missing - not even a notice or a warning message to let me know that the end of the world is approaching. :) Regards, Kouber Saparev
> The checkpointer is entirely incapable of either detecting the problem > (it doesn't have enough infrastructure to examine pg_database in a > reasonable way) or preventing backends from doing anything if it did > know there was a problem. Well, I guess I meant 'some regularly running process'... >>I think people'd rather their db just stopped accepting new transactions >>rather than just losing data... > > Not being able to issue new transactions *is* data loss --- how are you > going to get the system out of that state? Not allowing any transactions except a vacuum... > autovacuum is the correct long-term solution to this, not some kind of > automatic hara-kiri. Yeah, seems like it should really happen soon... Chris
>> The checkpointer is entirely incapable of either detecting the problem >> (it doesn't have enough infrastructure to examine pg_database in a >> reasonable way) or preventing backends from doing anything if it did >> know there was a problem. > > Well, I guess I meant 'some regularly running process'... > >>>I think people'd rather their db just stopped accepting new transactions >>>rather than just losing data... >> >> Not being able to issue new transactions *is* data loss --- how are you >> going to get the system out of that state? > > Not allowing any transactions except a vacuum... > >> autovacuum is the correct long-term solution to this, not some kind of >> automatic hara-kiri. > > Yeah, seems like it should really happen soon... > > Chris Maybe I'm missing something, but shouldn't the prospect of data loss (even in the presense of admin ignorance) be something that should be unacceptable? Certainly within the realm "normal PostgreSQL" operation.
pgsql@mohawksoft.com writes: > Maybe I'm missing something, but shouldn't the prospect of data loss (even > in the presense of admin ignorance) be something that should be > unacceptable? Certainly within the realm "normal PostgreSQL" operation. [ shrug... ] The DBA will always be able to find a way to shoot himself in the foot. We've seen several instances of people blowing away pg_xlog and pg_clog, for example, because they "don't need log files". Or how about failing to keep adequate backups? That's a sure way for an ignorant admin to lose data too. Once autovacuum gets to the point where it's used by default, this particular failure mode should be a thing of the past, but in the meantime I'm not going to panic about it. regards, tom lane
> pgsql@mohawksoft.com writes: >> Maybe I'm missing something, but shouldn't the prospect of data loss >> (even >> in the presense of admin ignorance) be something that should be >> unacceptable? Certainly within the realm "normal PostgreSQL" operation. > > [ shrug... ] The DBA will always be able to find a way to shoot himself > in the foot. We've seen several instances of people blowing away > pg_xlog and pg_clog, for example, because they "don't need log files". > Or how about failing to keep adequate backups? That's a sure way for an > ignorant admin to lose data too. There is a difference between actively doing something stupid and failing to realize a maintenence task is required. PostgreSQL should stop working. When the admin tries to understand why, they can read a troubleshooting FAQ and say "oops, I gotta run this vacuum thingy." That is a whole lot better than falling off a cliff you didn't even know was there. > > Once autovacuum gets to the point where it's used by default, this > particular failure mode should be a thing of the past, but in the > meantime I'm not going to panic about it. I don't know how to say this without sounding like a jerk, (I guess that's my role sometimes) but would you go back and re-read this sentence? To paraphrase: "I know this causes a catestrophic data loss, and we have plans to fix it in the future, but for now, I'm not going panic about it." What would you do if the FreeBSD group or Linux kernel group said this about a file system? If you failed to run fsck after 100 mounts, you loose your data? I thought PostgreSQL was about "protecting your data." How many times have we smugly said, "yea, you can use MySQL if you don't care about your data." Any data loss caused by postgresql should be seen as unacceptable. It's funny, while I've known about this for a while, and it has always seemed a sort of distant edge condition that is easily avoided. However, with todays faster machines and disks, it is easier to reach this limitation than ever before. All PostgreSQL needs is one or two VERY UPSET mainstream users who lose data to completely reverse the momemntum that it is gaining. No amount of engineering discussion about it not being the fault of postgresql will be lost, and rightfully so, IMHO. Sorry.
On Wed, 16 Feb 2005 pgsql@mohawksoft.com wrote: > > > > Once autovacuum gets to the point where it's used by default, this > > particular failure mode should be a thing of the past, but in the > > meantime I'm not going to panic about it. > > I don't know how to say this without sounding like a jerk, (I guess that's > my role sometimes) but would you go back and re-read this sentence? > > To paraphrase: "I know this causes a catestrophic data loss, and we have > plans to fix it in the future, but for now, I'm not going panic about it." Do you have a useful suggestion about how to fix it? "Stop working" is handwaving and merely basically saying, "one of you people should do something about this" is not a solution to the problem, it's not even an approach towards a solution to the problem.
>>in the foot. We've seen several instances of people blowing away >>pg_xlog and pg_clog, for example, because they "don't need log files". >>Or how about failing to keep adequate backups? That's a sure way for an >>ignorant admin to lose data too. >> >> > >There is a difference between actively doing something stupid and failing >to realize a maintenence task is required. > >PostgreSQL should stop working. When the admin tries to understand why, >they can read a troubleshooting FAQ and say "oops, I gotta run this vacuum >thingy." That is a whole lot better than falling off a cliff you didn't >even know was there. > > There is another way to look at this as lends itself to mohawksoft's argument. More often than not DBAs and Sysadmins are neither one. They are people that get shoved into the job because they happen to mention around the water cooler that they "once" installed linux/freebsd -- whatever. Maybe it is an executive that has some of his brains left after sitting behind a desk all day for the last 10 years. One day he/she gets a thought in his head to create a new project named "foo". He does not want to waste his internal resources so said executive decides he will do it himself as a hobby. For some reason, the project actually succeeds (I have seen this many times) and the company starts using it. Well guess what... it uses PostgreSQL. The guy isn't a DBA, heck he is even really a programmer. He had know idea about this "vacuum" thing. He had never heard of other databases having to do it. So they run for a year, and then all of a sudden **BOOM** the world ends. Do you think they are going to care that we "documented" the issue? Uhmmm no they won't. Chances are they will drop kick PostgreSQL and bad talk it to all their other executive friends. In short, this whole argument has the mark of irresponsibility on both parties but it is is the PostgreSQL projects responisbility to make reasonable effort to produce a piece of software that doesn't break. We are not talking about a user who ran a query: delete from foo; At this point we have a known critical bug. Usually the PostgreSQL community is all over critical bugs. Why is this any different? It sounds to me that people are just annoyed that users don't RTFM. Get over it. Most won't. If users RTFM more often, it would put most support companies out of business. Sincerely, Joshua D. Drake -- Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC Postgresql support, programming shared hosting and dedicated hosting. +1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com PostgreSQL Replicator -- production quality replication for PostgreSQL
Attachment
>Do you have a useful suggestion about how to fix it? "Stop working" is >handwaving and merely basically saying, "one of you people should do >something about this" is not a solution to the problem, it's not even an >approach towards a solution to the problem. > > I believe that the ability for PostgreSQL to stop accepting queries and to log to the file or STDERR why it stopped working and how to resolve it is appropriate. Also it is probably appropriate to warn ahead of time... WARNING: Only 50,000 transactions left before lock out or something like that. J >---------------------------(end of broadcast)--------------------------- >TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > > -- Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC Postgresql support, programming shared hosting and dedicated hosting. +1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com PostgreSQL Replicator -- production quality replication for PostgreSQL
Attachment
> At this point we have a known critical bug. Usually the PostgreSQL > community > is all over critical bugs. Why is this any different? > > It sounds to me that people are just annoyed that users don't RTFM. Get > over it. Most won't. If users RTFM more often, it would put most support > companies out of business. I wonder if I should point out that we just had 3 people suffering XID wraparound failure in 2 days in the IRC channel... Chris
Christopher Kings-Lynne wrote: >> At this point we have a known critical bug. Usually the PostgreSQL >> community >> is all over critical bugs. Why is this any different? >> >> It sounds to me that people are just annoyed that users don't RTFM. >> Get over it. Most won't. If users RTFM more often, it would put most >> support companies out of business. > > > I wonder if I should point out that we just had 3 people suffering XID > wraparound failure in 2 days in the IRC channel... I have had half a dozen new customers in the last six months that have had the same problem. Nothing like the phone call: Uhmmm I am a new customer, help I can't see my databases. Sincerely, Joshua D. Drake > > Chris -- Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC Postgresql support, programming shared hosting and dedicated hosting. +1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com PostgreSQL Replicator -- production quality replication for PostgreSQL
Attachment
On Wed, 16 Feb 2005, Joshua D. Drake wrote: > > >Do you have a useful suggestion about how to fix it? "Stop working" is > >handwaving and merely basically saying, "one of you people should do > >something about this" is not a solution to the problem, it's not even an > >approach towards a solution to the problem. > > > > > I believe that the ability for PostgreSQL to stop accepting > queries and to log to the file or STDERR why it stopped working > and how to resolve it is appropriate. Right, but since the how to resolve it currently involves executing a query, simply stopping dead won't allow you to resolve it. Also, if we stop at the exact wraparound point, can we run into problems actually trying to do the vacuum if that's still the resolution technique? If so, how far in advance of wraparound must we stop to guarantee it will succeed? It's not rocket science, but figuring such things out is part of actually making a workable solution.
> On Wed, 16 Feb 2005 pgsql@mohawksoft.com wrote: > >> > >> > Once autovacuum gets to the point where it's used by default, this >> > particular failure mode should be a thing of the past, but in the >> > meantime I'm not going to panic about it. >> >> I don't know how to say this without sounding like a jerk, (I guess >> that's >> my role sometimes) but would you go back and re-read this sentence? >> >> To paraphrase: "I know this causes a catestrophic data loss, and we have >> plans to fix it in the future, but for now, I'm not going panic about >> it." > > Do you have a useful suggestion about how to fix it? "Stop working" is > handwaving and merely basically saying, "one of you people should do > something about this" is not a solution to the problem, it's not even an > approach towards a solution to the problem. Actually, it is not a solution to the problem of losing data. It is a drop dead last ditch failsafe that EVERY PRODUCT should have before losing data. >
On Wed, 16 Feb 2005 pgsql@mohawksoft.com wrote: > > On Wed, 16 Feb 2005 pgsql@mohawksoft.com wrote: > > > >> > > >> > Once autovacuum gets to the point where it's used by default, this > >> > particular failure mode should be a thing of the past, but in the > >> > meantime I'm not going to panic about it. > >> > >> I don't know how to say this without sounding like a jerk, (I guess > >> that's > >> my role sometimes) but would you go back and re-read this sentence? > >> > >> To paraphrase: "I know this causes a catestrophic data loss, and we have > >> plans to fix it in the future, but for now, I'm not going panic about > >> it." > > > > Do you have a useful suggestion about how to fix it? "Stop working" is > > handwaving and merely basically saying, "one of you people should do > > something about this" is not a solution to the problem, it's not even an > > approach towards a solution to the problem. > > Actually, it is not a solution to the problem of losing data. It is a drop > dead last ditch failsafe that EVERY PRODUCT should have before losing > data. Let's try again. Saying, "one of you people should do something about this" is not a solution to the problem or an approach thereto. "Stop working" is handwaving since I see no approach therein that allows the user to actually recover the data.
> > On Wed, 16 Feb 2005, Joshua D. Drake wrote: > >> >> >Do you have a useful suggestion about how to fix it? "Stop working" is >> >handwaving and merely basically saying, "one of you people should do >> >something about this" is not a solution to the problem, it's not even >> an >> >approach towards a solution to the problem. >> > >> > >> I believe that the ability for PostgreSQL to stop accepting >> queries and to log to the file or STDERR why it stopped working >> and how to resolve it is appropriate. > > Right, but since the how to resolve it currently involves executing a > query, simply stopping dead won't allow you to resolve it. Also, if we > stop at the exact wraparound point, can we run into problems actually > trying to do the vacuum if that's still the resolution technique? If so, > how far in advance of wraparound must we stop to guarantee it will > succeed? It's not rocket science, but figuring such things out is part of I would say, have a GUC parameter set at 1000 transactions. When fewer than this number are available, postmaster will not run and issue a message "Transaction wrap-around error! You must run vacuum in stingle user postgres mode to correct it, to avoid this message run the vacuum command more frequently" Hell, why not block all the PostgreSQL processes and run vacuum? But, for now, versions of PostgreSQL should stop before losing data.
> Stephan Szabo <sszabo@megazone.bigpanda.com> writes: >> Right, but since the how to resolve it currently involves executing a >> query, simply stopping dead won't allow you to resolve it. Also, if we >> stop at the exact wraparound point, can we run into problems actually >> trying to do the vacuum if that's still the resolution technique? > > We'd have to do something with a fair amount of slop. The idea I was > toying with just now involved a forcible shutdown once we get within > say 100,000 transactions of a wrap failure; but apply this check only > when in interactive operation. This would allow the DBA to perform > the needed VACUUMing manually in a standalone backend. > > The real question here is exactly how large a cluestick do you want to > hit the DBA with. I don't think we can "guarantee" no data loss with > anything less than forced shutdown, but that's not so much a cluestick > as a clue howitzer. I think a DBA or accidental DBA would prefer stating in a meeting: "Yea, the database shut down because I didn't perform normal maintenence, its fixed now and we have a script in place so it won't happen again" Over "Yea, the database lost all its data and we have to restore from our last backup because I didn't perform normal maintenence." One gets a "boy are you lucky" over a "you're fired." > > Maybe > > (a) within 200,000 transactions of wrap, every transaction start > delivers a WARNING message; > > (b) within 100,000 transactions, forced shutdown as above. I agree.
Stephan Szabo <sszabo@megazone.bigpanda.com> writes: > Right, but since the how to resolve it currently involves executing a > query, simply stopping dead won't allow you to resolve it. Also, if we > stop at the exact wraparound point, can we run into problems actually > trying to do the vacuum if that's still the resolution technique? We'd have to do something with a fair amount of slop. The idea I was toying with just now involved a forcible shutdown once we get within say 100,000 transactions of a wrap failure; but apply this check only when in interactive operation. This would allow the DBA to perform the needed VACUUMing manually in a standalone backend. The real question here is exactly how large a cluestick do you want to hit the DBA with. I don't think we can "guarantee" no data loss with anything less than forced shutdown, but that's not so much a cluestick as a clue howitzer. Maybe (a) within 200,000 transactions of wrap, every transaction start delivers a WARNING message; (b) within 100,000 transactions, forced shutdown as above. regards, tom lane
> > On Wed, 16 Feb 2005 pgsql@mohawksoft.com wrote: > >> > On Wed, 16 Feb 2005 pgsql@mohawksoft.com wrote: >> > >> >> > >> >> > Once autovacuum gets to the point where it's used by default, this >> >> > particular failure mode should be a thing of the past, but in the >> >> > meantime I'm not going to panic about it. >> >> >> >> I don't know how to say this without sounding like a jerk, (I guess >> >> that's >> >> my role sometimes) but would you go back and re-read this sentence? >> >> >> >> To paraphrase: "I know this causes a catestrophic data loss, and we >> have >> >> plans to fix it in the future, but for now, I'm not going panic about >> >> it." >> > >> > Do you have a useful suggestion about how to fix it? "Stop working" >> is >> > handwaving and merely basically saying, "one of you people should do >> > something about this" is not a solution to the problem, it's not even >> an >> > approach towards a solution to the problem. >> >> Actually, it is not a solution to the problem of losing data. It is a >> drop >> dead last ditch failsafe that EVERY PRODUCT should have before losing >> data. > > Let's try again. Saying, "one of you people should do something about > this" is not a solution to the problem or an approach thereto. "Stop > working" is handwaving since I see no approach therein that allows the > user to actually recover the data. > Well, it is sort of the the Hockey strike, now that it seems like stoping normal operation is better than losing billions of rows of data. We can decide who to do it and how to correct it.
On Wed, 16 Feb 2005, Tom Lane wrote: > Stephan Szabo <sszabo@megazone.bigpanda.com> writes: > > Right, but since the how to resolve it currently involves executing a > > query, simply stopping dead won't allow you to resolve it. Also, if we > > stop at the exact wraparound point, can we run into problems actually > > trying to do the vacuum if that's still the resolution technique? > > We'd have to do something with a fair amount of slop. The idea I was > toying with just now involved a forcible shutdown once we get within > say 100,000 transactions of a wrap failure; but apply this check only > when in interactive operation. This would allow the DBA to perform > the needed VACUUMing manually in a standalone backend. > > The real question here is exactly how large a cluestick do you want to > hit the DBA with. I don't think we can "guarantee" no data loss with > anything less than forced shutdown, but that's not so much a cluestick > as a clue howitzer. > > Maybe > > (a) within 200,000 transactions of wrap, every transaction start > delivers a WARNING message; > > (b) within 100,000 transactions, forced shutdown as above. This seems reasonable, although perhaps the former could be something configurable. I'm not sure there's a good reason to allow the latter to change unless there'd ever be a case where 100,000 transactions wasn't enough to vacuum or something like that. All in all, I figure that odds are very high that if someone isn't vacuuming in the rest of the transaction id space, either the transaction rate is high enough that 100,000 warning may not be enough or they aren't going to pay attention anyway and the howitzer might not be bad.
On Wed, Feb 16, 2005 at 09:38:31 -0800, Stephan Szabo <sszabo@megazone.bigpanda.com> wrote: > On Wed, 16 Feb 2005, Tom Lane wrote: > > > (a) within 200,000 transactions of wrap, every transaction start > > delivers a WARNING message; > > > > (b) within 100,000 transactions, forced shutdown as above. > > This seems reasonable, although perhaps the former could be something > configurable. I'm not sure there's a good reason to allow the latter to > change unless there'd ever be a case where 100,000 transactions wasn't > enough to vacuum or something like that. I don't think there is much point in making it configurable. If they knew to do that they would most likely know to vacuum as well. However, 100K out of 1G seems too small. Just to get wrap around there must be a pretty high transaction rate, so 100K may not give much warning. 1M or 10M seem to be better.
Stephan Szabo wrote: > On Wed, 16 Feb 2005, Tom Lane wrote: > >>Stephan Szabo <sszabo@megazone.bigpanda.com> writes: >> >>(a) within 200,000 transactions of wrap, every transaction start >>delivers a WARNING message; >> >>(b) within 100,000 transactions, forced shutdown as above. > > > This seems reasonable, although perhaps the former could be something > configurable. I'm not sure there's a good reason to allow the latter to > change unless there'd ever be a case where 100,000 transactions wasn't > enough to vacuum or something like that. > > All in all, I figure that odds are very high that if someone isn't > vacuuming in the rest of the transaction id space, either the transaction > rate is high enough that 100,000 warning may not be enough or they aren't > going to pay attention anyway and the howitzer might not be bad. How would people feel about stopping after the first 100 transactions too? Pro: Teaches the lesson straight away. Con: Irritating Con: Might not be enough time for automated installers -- Richard Huxton Archonet Ltd
Tom Lane wrote: >Maybe > >(a) within 200,000 transactions of wrap, every transaction start >delivers a WARNING message; > >(b) within 100,000 transactions, forced shutdown as above. > > > > This seems sound enough, but if the DBA and/or SA can't be bothered reading the docs where this topic features quite prominently, I suspect the warning messages won't have much effect either. Basically ISTM we're talking about people who *need* a clue howitzer. This will possibly hit us more now we have the Windows port (or maybe not, if the Windows servers are regularly rebooted ;-) ) cheers andrew
Stephan Szabo <sszabo@megazone.bigpanda.com> writes: > All in all, I figure that odds are very high that if someone isn't > vacuuming in the rest of the transaction id space, either the transaction > rate is high enough that 100,000 warning may not be enough or they aren't > going to pay attention anyway and the howitzer might not be bad. Yeah. It's easy to imagine scenarios where the majority of the warnings go into the bit bucket (because they are going to noninteractive client applications that just ignore NOTICE messages). So I think it's appropriate to be delivering the warnings for a good long time, in hopes that someone at least occasionally fires up psql and happens to actually see them. Something like 100K or 1M transactions feels about right to me. Pulling the failure trigger with 100K transactions still to go is surely overly conservative, but compared to the size of the ID space it is not worth noticing. As far as the actual implementation, I was envisioning adding a limiting XID variable and a database name variable to shared memory (protected by the same LWLock that protects the nextXID counter). These would be computed and loaded during the bootstrap process, right after we finish WAL replay if any. It would probably cost us one XID to do this (though maybe it could be done without running a real transaction? This ties in with my thoughts about replacing GetRawDatabaseInfo with a flat file...), but one XID per postmaster start attempt is hopefully not gonna kill us. Subsequently, any VACUUM that updates a datfrozenxid entry in pg_database would update these variables to reflect the new safe limit and the name of the database with the currently oldest datfrozenxid. This would allow a very cheap comparison during GetNewTransactionId to see if we are near enough to generate a warning: WARNING: database "foo" must be vacuumed within 58372 transactions or past the limit and generate an error: ERROR: database is shut down to avoid wraparound data loss in database "foo" HINT: Stop the postmaster and use a standalone backend to VACUUM in "foo". In the error case, we could error out *without* advancing nextXID, so that even automated clients continually retrying failed transactions couldn't blow past the safety margin. regards, tom lane
Bruno Wolff III <bruno@wolff.to> writes: > I don't think there is much point in making it configurable. If they knew > to do that they would most likely know to vacuum as well. Agreed. > However, 100K out of 1G seems too small. Just to get wrap around there > must be a pretty high transaction rate, so 100K may not give much warning. > 1M or 10M seem to be better. Good point. Even 10M is less than 1% of the ID space. Dunno about you, but the last couple cars I've owned start flashing warnings when the gas tank is about 20% full, not 1% full... regards, tom lane
"Joshua D. Drake" <jd@commandprompt.com> writes: > Christopher Kings-Lynne wrote: > > > I wonder if I should point out that we just had 3 people suffering XID > > wraparound failure in 2 days in the IRC channel... > > I have had half a dozen new customers in the last six months that have > had the same problem. Nothing like the phone call: How are so many people doing so many transactions so soon after installing? To hit wraparound you have to do a billion transactions? ("With a `B'") That takes real work. If you did 1,000 txn/minute for every minute of every day it would still take a couple years to get there. And most databases get a mix of updates and selects. I would expect it would be pretty hard to go that long with any significant level of update activity and no vacuums and not notice the performance problems from the dead tuples. What am I missing. Is there a significant percentage of the user base that's doing nothing but loading huge static databases and then performing massive loads (like thousands of queries per second) of purely read-only queries against them? -- greg
Greg Stark <gsstark@mit.edu> writes: > How are so many people doing so many transactions so soon after installing? > To hit wraparound you have to do a billion transactions? ("With a `B'") That > takes real work. If you did 1,000 txn/minute for every minute of every day it > would still take a couple years to get there. > And most databases get a mix of updates and selects. I would expect it would > be pretty hard to go that long with any significant level of update activity > and no vacuums and not notice the performance problems from the dead tuples. I think the people who've managed to shoot themselves in the foot this way are those who decided to "optimize" their cron jobs to only vacuum their user tables, and forgot about the system catalogs. So it's probably more of a case of "a little knowledge is a dangerous thing" than never having heard of VACUUM at all. I too don't see that you could possibly get to 2 billion transactions without having found out that Postgres requires regular VACUUMing. regards, tom lane
> I think the people who've managed to shoot themselves in the foot this > way are those who decided to "optimize" their cron jobs to only vacuum > their user tables, and forgot about the system catalogs. So it's > probably more of a case of "a little knowledge is a dangerous thing" > than never having heard of VACUUM at all. I too don't see that you > could possibly get to 2 billion transactions without having found out > that Postgres requires regular VACUUMing. I have had two new customers in the last year who didn't do any vacuums that had a rollover. The database sat in a backroom and did processing. It just worked so they didn't worry about it. It took one of them almost two years to get there but it does happen. Sincerely, Joshua D. Drake > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org -- Command Prompt, Inc., your source for PostgreSQL replication, professional support, programming, managed services, shared and dedicated hosting. Home of the Open Source Projects plPHP, plPerlNG, pgManage, and pgPHPtoolkit. Contact us now at: +1-503-667-4564 - http://www.commandprompt.com
Attachment
Tom Lane wrote: >pgsql@mohawksoft.com writes: > > >>Maybe I'm missing something, but shouldn't the prospect of data loss (even >>in the presense of admin ignorance) be something that should be >>unacceptable? Certainly within the realm "normal PostgreSQL" operation. >> >> > >Once autovacuum gets to the point where it's used by default, this >particular failure mode should be a thing of the past, but in the >meantime I'm not going to panic about it. > Which I hope will be soon.
Greg Stark wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: > > >>Christopher Kings-Lynne wrote: >> >> >>>I wonder if I should point out that we just had 3 people suffering XID >>>wraparound failure in 2 days in the IRC channel... >> >>I have had half a dozen new customers in the last six months that have >>had the same problem. Nothing like the phone call: > > > How are so many people doing so many transactions so soon after installing? > > To hit wraparound you have to do a billion transactions? ("With a `B'") That > takes real work. If you did 1,000 txn/minute for every minute of every day it > would still take a couple years to get there. We do ~4000 txn/minute so in 6 month you are screewd up... Regards Gaetano Mendola
Stephan Szabo wrote: > On Wed, 16 Feb 2005 pgsql@mohawksoft.com wrote: > > >>>Once autovacuum gets to the point where it's used by default, this >>>particular failure mode should be a thing of the past, but in the >>>meantime I'm not going to panic about it. >> >>I don't know how to say this without sounding like a jerk, (I guess that's >>my role sometimes) but would you go back and re-read this sentence? >> >>To paraphrase: "I know this causes a catestrophic data loss, and we have >>plans to fix it in the future, but for now, I'm not going panic about it." > > > Do you have a useful suggestion about how to fix it? "Stop working" is > handwaving and merely basically saying, "one of you people should do > something about this" is not a solution to the problem, it's not even an > approach towards a solution to the problem. Is not a solution but between loose data and shutdown the postmaster I prefer the shutdown. Regards Gaetano Mendola
Tom Lane wrote: > Bruno Wolff III <bruno@wolff.to> writes: > >>I don't think there is much point in making it configurable. If they knew >>to do that they would most likely know to vacuum as well. > > > Agreed. > > >>However, 100K out of 1G seems too small. Just to get wrap around there >>must be a pretty high transaction rate, so 100K may not give much warning. >>1M or 10M seem to be better. > > > Good point. Even 10M is less than 1% of the ID space. Dunno about you, > but the last couple cars I've owned start flashing warnings when the gas > tank is about 20% full, not 1% full... BTW, why not do an automatic vacuum instead of shutdown ? At least the DB do not stop working untill someone study what the problem is and how solve it. Regards Gaetano Mendola
Gaetano Mendola <mendola@bigfoot.com> writes: > BTW, why not do an automatic vacuum instead of shutdown ? At least the > DB do not stop working untill someone study what the problem is and > how solve it. No, the entire point of this discussion is to whup the DBA upside the head with a big enough cluestick to get him to install autovacuum. Once autovacuum is default, it won't matter anymore. regards, tom lane
Gaetano Mendola <mendola@bigfoot.com> writes: > We do ~4000 txn/minute so in 6 month you are screewd up... Sure, but if you ran without vacuuming for 6 months, wouldn't you notice the huge slowdowns from all those dead tuples before that? -- greg
On 17 Feb 2005, Greg Stark wrote: > Gaetano Mendola <mendola@bigfoot.com> writes: > > > We do ~4000 txn/minute so in 6 month you are screewd up... > > Sure, but if you ran without vacuuming for 6 months, wouldn't you notice the > huge slowdowns from all those dead tuples before that? Most people that we have seen on irc that run into the problem do vacuum some tables, but forget to vacuum all. Then their tables work fine but suddenly some system tables like pg_databases and pg_shadow become empty since they never was vacuumed... -- /Dennis Björklund
>>And most databases get a mix of updates and selects. I would expect it would >>be pretty hard to go that long with any significant level of update activity >>and no vacuums and not notice the performance problems from the dead tuples. > > > I think the people who've managed to shoot themselves in the foot this > way are those who decided to "optimize" their cron jobs to only vacuum > their user tables, and forgot about the system catalogs. That's certainly the case with one of the people we helped in IRC - 3 user tables only being vacuumed. Chris
> Gaetano Mendola <mendola@bigfoot.com> writes: > >> We do ~4000 txn/minute so in 6 month you are screewd up... > > Sure, but if you ran without vacuuming for 6 months, wouldn't you notice > the > huge slowdowns from all those dead tuples before that? > > I would think that only applies to databases where UPDATE and DELETE are done often. What about databases that are 99.999% inserts? A DBA lightly going over the docs may not even know that vacuum needs to be run.
Greg Stark wrote: > Gaetano Mendola <mendola@bigfoot.com> writes: > > >>We do ~4000 txn/minute so in 6 month you are screewd up... > > > Sure, but if you ran without vacuuming for 6 months, wouldn't you notice the > huge slowdowns from all those dead tuples before that? > In my applications yes, for sure I see the huge slowdown after 2 days without it, but giveng the fact that someone crossed the limit I immagine that is possible without performance loose Regards Gaetano Mendola
Tom Lane wrote: > Gaetano Mendola <mendola@bigfoot.com> writes: > > BTW, why not do an automatic vacuum instead of shutdown ? At least the > > DB do not stop working untill someone study what the problem is and > > how solve it. > > No, the entire point of this discussion is to whup the DBA upside the > head with a big enough cluestick to get him to install autovacuum. > > Once autovacuum is default, it won't matter anymore. I have a concern about this that I hope is just based on some misunderstanding on my part. My concern is: suppose that a database is modified extremely infrequently? So infrequently, in fact, that over a billion read transactions occur before the next write transaction. Once that write transaction occurs, you're hosed, right? Autovacuum won't catch this because it takes action based on the write activity that occurs in the tables. So: will autovacuum be coded to explicitly look for transaction wraparound, or to automatically vacuum every N number of transactions (e.g., 500 million)? -- Kevin Brown kevin@sysexperts.com
Just wondering after this discussion: Is transaction wraparound limited to a database or to an installation ? i.e. can heavy traffic in one db affect another db in the same installation ?
On Fri, 18 Feb 2005 04:38 pm, Kevin Brown wrote: > Tom Lane wrote: > > Gaetano Mendola <mendola@bigfoot.com> writes: > > > BTW, why not do an automatic vacuum instead of shutdown ? At least the > > > DB do not stop working untill someone study what the problem is and > > > how solve it. > > > > No, the entire point of this discussion is to whup the DBA upside the > > head with a big enough cluestick to get him to install autovacuum. > > > > Once autovacuum is default, it won't matter anymore. > > I have a concern about this that I hope is just based on some > misunderstanding on my part. > > My concern is: suppose that a database is modified extremely > infrequently? So infrequently, in fact, that over a billion read > transactions occur before the next write transaction. Once that write > transaction occurs, you're hosed, right? Autovacuum won't catch this > because it takes action based on the write activity that occurs in the > tables. > > So: will autovacuum be coded to explicitly look for transaction > wraparound, or to automatically vacuum every N number of transactions > (e.g., 500 million)? > autovacuum already checks for both Transaction wraparound, and table updates. It vacuums individual tables as they need it, from a free space/recovery point of view. It also does checks to ensure that no database is nearing transaction wraparound, if it is, it initiates a database wide vacuum to resolve that issue. Regards Russell Smith > >
On Fri, 18 Feb 2005 08:53 pm, Jürgen Cappel wrote: > Just wondering after this discussion: > > Is transaction wraparound limited to a database or to an installation ? > i.e. can heavy traffic in one db affect another db in the same installation ? > XID's are global to the pg cluster, or installation. So not using a database will still cause XID wraparound to occur on that database. Regards Russell Smith. > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend > >
On Thursday 17 February 2005 07:47, pgsql@mohawksoft.com wrote: > > Gaetano Mendola <mendola@bigfoot.com> writes: > >> We do ~4000 txn/minute so in 6 month you are screewd up... > > > > Sure, but if you ran without vacuuming for 6 months, wouldn't you notice > > the > > huge slowdowns from all those dead tuples before that? > > I would think that only applies to databases where UPDATE and DELETE are > done often. What about databases that are 99.999% inserts? A DBA lightly > going over the docs may not even know that vacuum needs to be run. > Yup... I don't vacuum a least a 100 of the tables in my schema cause they are continuous insert with big deletions once every 6 months or so. Generally speaking it isn't worth the performance hit to vacuum these big tables regularly, so I only do it when I have to.... like every six months when wrap around gets close. -- Robert Treat Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL
Russell Smith wrote: >On Fri, 18 Feb 2005 04:38 pm, Kevin Brown wrote: > > >>Tom Lane wrote: >> >> >>> No, the entire point of this discussion is to whup the DBA upside the >>> >>>head with a big enough cluestick to get him to install autovacuum. >>> >>>Once autovacuum is default, it won't matter anymore. >>> >>> >>I have a concern about this that I hope is just based on some >>misunderstanding on my part. >> >>My concern is: suppose that a database is modified extremely >>infrequently? So infrequently, in fact, that over a billion read >>transactions occur before the next write transaction. Once that write >>transaction occurs, you're hosed, right? Autovacuum won't catch this >>because it takes action based on the write activity that occurs in the >>tables. >> >>So: will autovacuum be coded to explicitly look for transaction >>wraparound, or to automatically vacuum every N number of transactions >>(e.g., 500 million)? >> >> >autovacuum already checks for both Transaction wraparound, and table updates. >It vacuums individual tables as they need it, from a free space/recovery point of view. > >It also does checks to ensure that no database is nearing transaction wraparound, if it >is, it initiates a database wide vacuum to resolve that issue. > Right, the check that autovacuum does for wraparound is totally separate from the monitoring of inserts updates and deletes.
Matthew T. O'Connor wrote: > Tom Lane wrote: > > >pgsql@mohawksoft.com writes: > > > > > >>Maybe I'm missing something, but shouldn't the prospect of data loss (even > >>in the presense of admin ignorance) be something that should be > >>unacceptable? Certainly within the realm "normal PostgreSQL" operation. > >> > >> > > > >Once autovacuum gets to the point where it's used by default, this > >particular failure mode should be a thing of the past, but in the > >meantime I'm not going to panic about it. > > > Which I hope will be soon. I am ready to help you implement integrated autovacuum in 8.1. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Does auto_vacuum vacuum the system tables? -tfo -- Thomas F. O'Connell Co-Founder, Information Architect Sitening, LLC http://www.sitening.com/ 110 30th Avenue North, Suite 6 Nashville, TN 37203-6320 615-260-0005 On Feb 16, 2005, at 5:42 PM, Matthew T. O'Connor wrote: > Tom Lane wrote: > >> pgsql@mohawksoft.com writes: >> >>> Maybe I'm missing something, but shouldn't the prospect of data loss >>> (even >>> in the presense of admin ignorance) be something that should be >>> unacceptable? Certainly within the realm "normal PostgreSQL" >>> operation. >>> >> >> Once autovacuum gets to the point where it's used by default, this >> particular failure mode should be a thing of the past, but in the >> meantime I'm not going to panic about it. >> > Which I hope will be soon.
Thomas F.O'Connell wrote: > Does auto_vacuum vacuum the system tables? Yes