Thread: Re: Safely Killing Backends (Was: Applications that leak connections)
> On Fri, Feb 04, 2005 at 01:44:10PM -0600, Thomas F.O\'Connell wrote: > > Is there any stronger medicine that\'s available (for instance, when a > > backend won\'t respond to SIGTERM) and has no unfortunate side effects? > > I just ran into this situation the other day (and made the unfortunate > > discovery that SIGABRT is as bad as SIGKILL as far as a postmaster is > > concerned). > > As soon as a backend dies a unnatural death, postmaster will rightly > consider that it may have corrupted the shared state. In turn > postmaster will kill all its children mercilessly so they don\'t spread > the disease. > > Even SIGTERM can have bad consequences if it arrives at the wrong time. > (That\'s why a function to close a remote connection was rejected.) > > So, short answer: no. This could be better than what is however. Management would be easier if there was a way to trigger a series of behaviors on a given signal to a child: The child (1) cancels and rollbacks any transactions it has open, (2) enters a mode where it attempts to communicate with the client and failing so does an orderly connection close. I would never go back to them, but I can say that the Sybase SQL Studio servers where much easier to manage in this regard. If you are not very careful about how you handle orphaned connections in Postgres you will likely lose data....not "maybe" like a long shot...but "likely". I suppose if your data is fairly static (e.g. website cms) then this would not happen often, but anything with a lot of tansactions it will. The best protection is to do extensive testing with any application you use or develop, but that\'s not possible for everyone to do a sufficient amount of testing to avoid some of these issues. If I was submitting patches for Postgres I\'d push a little harder, and if I were, this problem would be at the top of my list as things to fix in Postgres. Best regards, Jim Wilson
Jim Wilson <jimw@kelcomaine.com> writes: > If you are not very careful about how you handle orphaned connections > in Postgres you will likely lose data....not "maybe" like a long > shot...but "likely". [ raised eyebrow ... ] Say again? I don't know of any reason why a lost connection would cause loss of (successfully committed) transactions. Not even if a DBA with an itchy "kill -9" trigger finger is in charge of cleaning up the lost connections. Please describe the scenarios you've had problems with. regards, tom lane
> Jim Wilson <jimw@kelcomaine.com> writes: > > If you are not very careful about how you handle orphaned connections > > in Postgres you will likely lose data....not "maybe" like a long > > shot...but "likely". > > [ raised eyebrow ... ] Say again? I don\'t know of any reason why a > lost connection would cause loss of (successfully committed) transactions. > Not even if a DBA with an itchy "kill -9" trigger finger is in charge of > cleaning up the lost connections. Please describe the scenarios you\'ve > had problems with. > > regards, tom lane > Well a couple things... One is I\\\'m talking about 7.3.x. We\\\'ll be moving our servers up to 7.4.x before the spring, but that\\\'s where these observations have been and maybe there are certain issues at that level. It is sometimes difficult to track transaction related issues down anyway, but I can say that in testing and earlier deployment we saw some things that did not look good and in practice we are very carefull dealing with lost connection issues. No kill -9 trigger fingers. Rather than getting into the raised eyebrow thing ;-), I\\\'d suggest checking your "qualifiers". Consider that with Postgres, if killing a single connection brings the whole server down, you will loose _all_ uncommitted data. If you did not, then I would call that a bug. The weakness is not in the data integrity (directly), it is in the integrity of the server processes and their managability. Best regards, Jim
On Fri, Feb 04, 2005 at 05:01:43PM -0500, Jim Wilson wrote: > Rather than getting into the raised eyebrow thing ;-), I\\\'d suggest > checking your "qualifiers". Consider that with Postgres, if killing a > single connection brings the whole server down, you will loose _all_ > uncommitted data. If you did not, then I would call that a bug. The > weakness is not in the data integrity (directly), it is in the > integrity of the server processes and their managability. Are you saying that your applications regularly leave uncommitted transactions for long periods of time? That sounds like bugs in your applications to me. Maybe I didn't get the part about lost connections. Do you mean that you applications lose conectivity to the server, and thus the transaction they were working with are lost? If that's the case, then it certainly sounds dangerous to commit whatever was there; what if the transaction was incomplete? Of course, if you can't commit it, the only way to proceed is to roll it back. What's with the backslashes anyway? -- Alvaro Herrera (<alvherre[@]dcc.uchile.cl>) Y dijo Dios: "Que sea Satanás, para que la gente no me culpe de todo a mí." "Y que hayan abogados, para que la gente no culpe de todo a Satanás"
Jim Wilson <jimw@kelcomaine.com> writes: > Rather than getting into the raised eyebrow thing ;-), I\\\'d suggest > checking your "qualifiers". Consider that with Postgres, if killing a > single connection brings the whole server down, you will loose _all_ > uncommitted data. If you did not, then I would call that a bug. I would too. So what's your complaint exactly? That kill -9'ing one backend takes out all your current uncommitted transactions and not just the one? I don't regard that as data loss: uncommitted is uncommitted. regards, tom lane
> Jim Wilson <jimw@kelcomaine.com> writes: > > Rather than getting into the raised eyebrow thing , I\\\\\\\'d suggest > > checking your "qualifiers". Consider that with Postgres, if killing a > > single connection brings the whole server down, you will loose _all_ > > uncommitted data. If you did not, then I would call that a bug. > > I would too. So what\'s your complaint exactly? That kill -9\'ing one > backend takes out all your current uncommitted transactions and not just > the one? I don\'t regard that as data loss: uncommitted is uncommitted. > > regards, tom lane We\'d like to be able to take out a connection without risking postmaster going down and thus losing all uncommitted data. Unfortunately the whole world isn\'t encapsulated in Postgres transactions or we would never have to worry about uncommitted data. Just because your tables are in sync, and your applications are designed in a way that coordinates with real world events and/or multiple entity transactions (e.g. electronic payments) as well as possible, doesn\'t mean that data integrity isn\'t put at risk by the loss of uncommitted information. This "problem", which is always there for the application designer, is exasperated when the whole server goes down because of an admin who, as delicately as possible, is just trying to remove an orphaned connection that\'s blocking transactions. Would I stop using Postgres if this was never changed? No way. Do I think it could be better than it is? Absolutely. For our purposes, it would be more important than some things like further query optimization, etc., that would be at the top of other user\'s lists. Best regards, Jim
> On Fri, Feb 04, 2005 at 05:01:43PM -0500, Jim Wilson wrote: > > > Rather than getting into the raised eyebrow thing , I\\\\\\\'d suggest > > checking your "qualifiers". Consider that with Postgres, if killing a > > single connection brings the whole server down, you will loose _all_ > > uncommitted data. If you did not, then I would call that a bug. The > > weakness is not in the data integrity (directly), it is in the > > integrity of the server processes and their managability. > > Are you saying that your applications regularly leave uncommitted > transactions for long periods of time? That sounds like bugs in your > applications to me. > I never said that. <snip> > > What\'s with the backslashes anyway? > Well, I\'m beta testing an email client. Good question! :-) Best, Jim Wilson
On Fri, 4 Feb 2005, Jim Wilson wrote: >> regards, tom lane > > We\'d like to be able to take out a connection without risking > postmaster going > down and thus losing all uncommitted data. > > Unfortunately the whole world isn\'t encapsulated in Postgres > transactions or we > would never have to worry about uncommitted data. Just because your > tables are > in sync, and your applications are designed in a way that coordinates > with real > world events and/or multiple entity transactions (e.g. electronic > payments) as well > as possible, doesn\'t mean that data integrity isn\'t put at risk by the > loss of > uncommitted information. I'm sorry, but I cannot make any sense out of the last sentence. By definition, data integrity refers only to committed data. Uncommitted data is not data at all, for both the parties involved. The server may receive a rollback anytime (which is not a delete!) and the client may get an error from the server, even at commit time (the last step). The network may go down in the middle of it. _Anything_ may happen. If your application is not able to handle a failure _before_ commit time, you should fix it. > This "problem", which is always there for the application designer, is > exasperated > when the whole server goes down because of an admin who, as delicately > as possible, > is just trying to remove an orphaned connection that\'s blocking > transactions. Would > I stop using Postgres if this was never changed? No way. Do I think it > could be > better than it is? Absolutely. For our purposes, it would be more > important than > some things like further query optimization, etc., that would be at the > top of other > user\'s lists. I think that the most important thing, for your purpose, as an "application designer", is to fix your application. There's no such a thing as "uncommitted data", it's not an intermediate state, it's not some kind of "less important" data. Let T be the commit time, before T there's _nothing_ and after T there's _everything_ (all of it). That's the A in ACID. Your application should handle failures in the middle of a transaction, connection failures included, in a graceful but correct way. I see your point (being able to safely shut a connection down on the server side), but it's at the _bottom_ of any list. .TM. -- ____/ ____/ / / / / Marco Colombo ___/ ___ / / Technical Manager / / / ESI s.r.l. _____/ _____/ _/ Colombo@ESI.it
> > Your application should handle failures in the middle of a transaction, > connection failures included, in a graceful but correct way. It does very well, until the next bug is discovered. > > I see your point (being able to safely shut a connection down on the > server side), but it\'s at the _bottom_ of any list. > > .TM. > -- > / / / > / / / Marco Colombo That\'s unfortunate. I\'ve tried to explain my position off list to Marco, but it really isn\'t worth debating. FWIW I think this thread was started by someone with application issues. The fact is, such things happen. Unfortunately Marco choses speaks for "any list" and I\'ll just repeat that I find this instability issue the most significant drawback for Postgres installations. This doesn\'t mean that there aren\'t other areas of priority for other users. And no, I do not want to debate the meaning of the word "instability". :-) Best regards, Jim Wilson
On Tue, 8 Feb 2005, Jim Wilson wrote: >> >> Your application should handle failures in the middle of a > transaction, >> connection failures included, in a graceful but correct way. > > It does very well, until the next bug is discovered. > >> >> I see your point (being able to safely shut a connection down on the >> server side), but it\'s at the _bottom_ of any list. >> >> .TM. >> -- >> / / / >> / / / Marco Colombo > > That\'s unfortunate. I\'ve tried to explain my position off list to > Marco, > but it really isn\'t worth debating. FWIW I think this thread was > started > by someone with application issues. The fact is, such things happen. > > Unfortunately Marco choses speaks for "any list" and I\'ll just > repeat that I find this instability issue the most significant drawback > > for Postgres installations. This doesn\'t mean that there aren\'t other > areas > of priority for other users. And no, I do not want to debate the > meaning > of the word "instability". :-) > > Best regards, > > Jim Wilson As I wrote in private mail, authenticated clients have many means to perform a DoS attack (whether intentionally or not). Most of cases can be handled only with a server restart. To put simply, PostgreSQL is not designed to handle hostile clients well. IMHO, a friendly enviroment (client behaviour) is a safe assumption for a RDBMS. It's not its job to paperbag over application bugs. Anyway, I agree in ending this thread. I recognize we have different meanings for "instability" and "data loss". .TM. -- ____/ ____/ / / / / Marco Colombo ___/ ___ / / Technical Manager / / / ESI s.r.l. _____/ _____/ _/ Colombo@ESI.it
Re: Safely Killing Backends (Was: Applications that leak connections)
From
Martijn van Oosterhout
Date:
On Tue, Feb 08, 2005 at 07:31:13AM -0500, Jim Wilson wrote: > That\'s unfortunate. I\'ve tried to explain my position off list to > Marco, but it really isn\'t worth debating. FWIW I think this thread > was started by someone with application issues. The fact is, such > things happen. Well, I read the thread on pg-hackers [1] about this being a bad idea currently and the issue seems to be: 1. The SIGTERM is the same as a FATAL error and this code path has not been very well tested. Are locks, etc all correctly removed? The only cases that *are* well tested are cases where these things don't matter. In other words, it will probably work fine, but it's not so well tested that the pg hackers are willing to bless a backend function implementing it. 2. If the backend is so stuck that SIGTERM isn't working, then I guess that's a bug but not enough examples have been collected to work out the problem. In this case you probably can't exit without considering the shared memory corrupt. 3. In theory it would be nice to have a "cancel then exit" signal, but we're clean out of signal numbers. 4. It appears the original person had a problem with not tracking used resources properly in a language that neither garbage-collects nor reference-counts. If you know you only ever want to open one connection you can solve this problem by creating an open_connection function which checks a global variable to see if a connection has already been opened and returns the same one if it has. > Unfortunately Marco choses speaks for "any list" and I\'ll just > repeat that I find this instability issue the most significant > drawback for Postgres installations. This doesn\'t mean that there > aren\'t other areas of priority for other users. And no, I do not > want to debate the meaning of the word "instability". :-) I guess it appears on the list of anybody who regularly deals with this problem. That list appears to be mutally exclusive with anyone who can fix it... I wonder how one would test the SIGTERM path anyway... To quote Tom Lane on chances of corruption [2]: > Not only wouldn't I give you those odds today, but I don't think we > could ever get to the point of saying that session kill is that > reliable, at least not from our ordinary methods of field testing. > It'd require significant focused code review and testing to acquire > such confidence, and continuing effort to make sure we didn't break > it again in the future. > > If we had infinite manpower I'd be happy to delegate a developer or > three to stay on top of this particular issue. But we don't :-( I don't know if PostgreSQL has ever had the concept of bounties for stuff. It's an interesting idea... [1] http://archives.postgresql.org/pgsql-patches/2004-07/msg00457.php [2] http://archives.postgresql.org/pgsql-patches/2004-07/msg00480.php Hope this helps, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.