Re: [Incident report]Backend process crashed when executing 2pc transaction - Mailing list pgsql-hackers

From Marco Slot
Subject Re: [Incident report]Backend process crashed when executing 2pc transaction
Date
Msg-id CANNhMLAjdTUzdwL50f8LX09je1jh+bZ6C4i=iZh8hgDEH0i0QA@mail.gmail.com
Whole thread Raw
In response to Re: [Incident report]Backend process crashed when executing 2pc transaction  (Amit Langote <amitlangote09@gmail.com>)
Responses Re: [Incident report]Backend process crashed when executing 2pc transaction  (Amit Langote <amitlangote09@gmail.com>)
List pgsql-hackers
On Thu, Nov 28, 2019 at 6:18 AM Amit Langote <amitlangote09@gmail.com> wrote:
> Interesting.  Still, I think you'd be in better position than anyone
> else to come up with reproduction steps for vanilla PostgreSQL by
> analyzing the stack trace if and when the crash next occurs (or using
> the existing core dump).  It's hard to tell by only guessing what may
> have gone wrong when there is external code involved, especially
> something like Citus that hooks into many points within vanilla
> PostgreSQL.

To clarify: In a Citus cluster you typically have a coordinator which
contains the "distributed tables" and one or more workers which
contain the data. All are PostgreSQL servers with the citus extension.
The coordinator uses every available hook in PostgreSQL to make the
distributed tables behave like regular tables. Any crash on the
coordinator is likely to be attributable to Citus, because most of the
code that is exercised is Citus code. The workers are used as regular
PostgreSQL servers with the coordinator acting as a regular client. On
the worker, the ProcessUtility hook will just pass on the arguments to
standard_ProcessUtility without any processing. The crash happened on
a worker.

One interesting thing is the prepared transaction name generated by
the coordinator, which follows the form: citus_<coordinator node
id>_<pid>_<server-wide transaction number >_<prepared transaction
number in session>. The server-wide transaction number is a 64-bit
counter that is kept in shared memory and starts at 1. That means that
over 4 billion (4207001212) transactions happened on the coordinator
since the server started, which quite possibly resulted in 4 billion
prepared transactions on this particular server. I'm wondering if some
counter is overflowing.

cheers,
Marco



pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: [HACKERS] Block level parallel vacuum
Next
From: Yugo Nagata
Date:
Subject: Re: Implementing Incremental View Maintenance