Re: Re: How to solve the problem of one backend process crashing and causing other processes to restart? - Mailing list pgsql-hackers

From Merlin Moncure
Subject Re: Re: How to solve the problem of one backend process crashing and causing other processes to restart?
Date
Msg-id CAHyXU0w-7r6e1rL4wqz5=z2Jg9=YDLBZLA=iHKqdv1jVuKFP4g@mail.gmail.com
Whole thread Raw
In response to Re:Re: How to solve the problem of one backend process crashing and causing other processes to restart?  (yuansong <yyuansong@126.com>)
List pgsql-hackers
On Mon, Nov 13, 2023 at 3:14 AM yuansong <yyuansong@126.com> wrote:

Enhancing the overall fault tolerance of the entire system for this feature is quite important. No one can avoid bugs, and I don't believe that this approach is a more advanced one. It might be worth considering adding it to the roadmap so that interested parties can conduct relevant research.

The current major issue is that when one process crashes, resetting all connections has a significant impact on other connections. Is it possible to only disconnect the crashed connection and have the other connections simply roll back the current transaction without reconnecting? Perhaps this problem can be mitigated through the use of a connection pool.

If we want to implement this feature, would it be sufficient to clean up or restore the shared memory and disk changes caused by the crashed backend? Besides clearing various known locks, what else needs to be changed? Does anyone have any insights? Please help me. Thank you.


One thing that's really key to understand about postgres is that there are a different set of rules regarding what is the database's job to solve vs supporting libraries and frameworks.  It isn't that hard to wait and retry a query in most applications, and it is up to you to do that.    There are also various connection poolers that might implement retry logic, and not having to work through those concerns keeps the code lean and has other benefits.  While postgres might implement things like a built in connection pooler, 'o_direct' type memory management, and things like that, there are long term costs to doing them.

There's another side to this.  Suppose I had to choose between a hypothetical postgres that had some kind of process local crash recovery and the current implementation. I might still choose the current implementation because, in general, crashes are good, and the full reset has a much better chance of clearing the underlying issue that caused the problem vs managing the symptoms of it.

merlin

pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Re: Why do indexes and sorts use the database collation?
Next
From: Amit Kapila
Date:
Subject: Re: Is this a problem in GenericXLogFinish()?