Home > mailing lists

Re: How to solve the problem of one backend process crashing and causing other processes to restart? - Mailing list pgsql-hackers

From	Tom Lane
Subject	Re: How to solve the problem of one backend process crashing and causing other processes to restart?
Date	November 13, 2023 02:55:48
Msg-id	1879095.1699844148@sss.pgh.pa.us Whole thread Raw
In response to	How to solve the problem of one backend process crashing and causing other processes to restart? (yuansong <yyuansong@126.com>)
Responses	Re: How to solve the problem of one backend process crashing and causing other processes to restart?
List	pgsql-hackers

Tree view

yuansong <yyuansong@126.com> writes:
> In PostgreSQL, when a backend process crashes, it can cause other backend processes to also require a restart,
primarilyto ensure data consistency. I understand that the correct approach is to analyze and identify the cause of the
crashand resolve it. However, it is also important to be able to handle a backend process crash without affecting the
operationof other processes, thus minimizing the scope of negative impact and improving availability. To achieve this
goal,could we mimic the Oracle process by introducing a "pmon" process dedicated to rolling back crashed process
transactionsand performing resource cleanup? I wonder if anyone has attempted such a strategy or if there have been
previousdiscussions on this topic.

The reason we force a database-wide restart is that there's no way to
be certain that the crashed process didn't corrupt anything in shared
memory. (Even with the forced restart, there's a window where bad
data could reach disk before we kill off the other processes that
might write it. But at least it's a short window.) "Corruption"
here doesn't just involve bad data placed into disk buffers; more
often it's things like unreleased locks, which would block other
processes indefinitely.

I seriously doubt that anything like what you're describing
could be made reliable enough to be acceptable. "Oracle does
it like this" isn't a counter-argument: they have a much different
(and non-extensible) architecture, and they also have an army of
programmers to deal with minutiae like undoing resource acquisition.
Even with that, you'd have to wonder about the number of bugs
existing in such necessarily-poorly-tested code paths.

regards, tom lane

pgsql-hackers by date:

From: Thomas Munro
Date: 13 November 2023, 02:38:34
Subject: Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From: Amit Kapila
Date: 13 November 2023, 03:15:12
Subject: Re: A recent message added to pg_upgade

Re: How to solve the problem of one backend process crashing and causing other processes to restart? - Mailing list pgsql-hackers

Previous

Next