Re: [Incident report]Backend process crashed when executing 2pc transaction - Mailing list pgsql-hackers

From Amit Langote
Subject Re: [Incident report]Backend process crashed when executing 2pc transaction
Date
Msg-id CA+HiwqGxmSxu8e07sNLEmKJqFm7-69QhidjA+huA1ifm0n1CnA@mail.gmail.com
Whole thread Raw
In response to Re: [Incident report]Backend process crashed when executing 2pc transaction  (Marco Slot <marco@citusdata.com>)
Responses RE: [Incident report]Backend process crashed when executing 2pctransaction  (Ranier Vilela <ranier_gyn@hotmail.com>)
List pgsql-hackers
Hi Marco,

On Thu, Nov 28, 2019 at 5:02 PM Marco Slot <marco@citusdata.com> wrote:
>
> On Thu, Nov 28, 2019 at 6:18 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > Interesting.  Still, I think you'd be in better position than anyone
> > else to come up with reproduction steps for vanilla PostgreSQL by
> > analyzing the stack trace if and when the crash next occurs (or using
> > the existing core dump).  It's hard to tell by only guessing what may
> > have gone wrong when there is external code involved, especially
> > something like Citus that hooks into many points within vanilla
> > PostgreSQL.
>
> To clarify: In a Citus cluster you typically have a coordinator which
> contains the "distributed tables" and one or more workers which
> contain the data. All are PostgreSQL servers with the citus extension.
> The coordinator uses every available hook in PostgreSQL to make the
> distributed tables behave like regular tables. Any crash on the
> coordinator is likely to be attributable to Citus, because most of the
> code that is exercised is Citus code. The workers are used as regular
> PostgreSQL servers with the coordinator acting as a regular client. On
> the worker, the ProcessUtility hook will just pass on the arguments to
> standard_ProcessUtility without any processing. The crash happened on
> a worker.

Thanks for clarifying.

> One interesting thing is the prepared transaction name generated by
> the coordinator, which follows the form: citus_<coordinator node
> id>_<pid>_<server-wide transaction number >_<prepared transaction
> number in session>. The server-wide transaction number is a 64-bit
> counter that is kept in shared memory and starts at 1. That means that
> over 4 billion (4207001212) transactions happened on the coordinator
> since the server started, which quite possibly resulted in 4 billion
> prepared transactions on this particular server. I'm wondering if some
> counter is overflowing.

Interesting.  This does kind of gets us closer to figuring out what
might have gone wrong, but hard to tell without the core dump at hand.

Thanks,
Amit



pgsql-hackers by date:

Previous
From: Daniel Gustafsson
Date:
Subject: Re: format of pg_upgrade loadable_libraries warning
Next
From: Hubert Zhang
Date:
Subject: Yet another vectorized engine