Re: What to do when dynamic shared memory control segment is corrupt - Mailing list pgsql-general

From Tom Lane
Subject Re: What to do when dynamic shared memory control segment is corrupt
Date
Msg-id 13596.1529379612@sss.pgh.pa.us
Whole thread Raw
In response to Re: What to do when dynamic shared memory control segment is corrupt  (Sherrylyn Branchaw <sbranchaw@gmail.com>)
Responses Re: What to do when dynamic shared memory control segment is corrupt
List pgsql-general
Sherrylyn Branchaw <sbranchaw@gmail.com> writes:
>> Hm ... were these installations built with --enable-cassert?  If not,
>> an abort trap seems pretty odd.

> The packages are installed directly from the yum repos for RHEL. I'm not
> aware that --enable-cassert is being used, and we're certainly not
> installing from source.

OK, I'm pretty sure nobody builds production RPMs with --enable-cassert.
But your extensions (as listed below) don't include any C++ code, so
that still leaves us wondering where the abort trap came from.  A stack
trace would almost certainly help clear that up.

>> Those "incomplete data" messages are quite unexpected and disturbing.

> We're using the stock initd script from the yum repo, but I dug into this
> issue, and it looks like we're passing the path to the postmaster.pid as
> the $pidfile variable in our sysconfig file, meaning the initd script is
> managing the postmaster.pid file, and specifically is overwriting it with a
> single line containing just the pid. I'm not sure why it's set up like
> this, and I'm thinking we should change it, but it seems harmless and
> unrelated to the crash. In particular, manual initd actions such as stop,
> start, restart, and status all work fine.

This is bad; a normal postmaster.pid file contains half a dozen lines
besides the PID proper.  You might get away with this for now, but it'll
break pg_ctl as of v10 or so, and might confuse other external tools
sooner than that.  Still, it doesn't seem related to your crash problem.

>> No, that looks like fairly typical crash recovery to me: corrupt shared
>> memory contents are expected and recovered from after a crash.

> That's reassuring. But if it's safe for us to immediately start the server
> back up, why did Postgres not automatically start the server up like it did
> the first time?

Yeah, I'd like to know that too.  The complaint about corrupt shared
memory may be just an unrelated red herring, or it might be a separate
effect of whatever the primary failure was ... but I think it was likely
not the direct cause of the failure-to-restart.  But we've got no real
evidence as to what that direct cause was.

> At any rate, if it's safe, we can write a script to detect this failure
> mode and automatically restart, although it would be less error-prone if
> Postgres restarted automatically.

I realize that you're most focused on less-downtime, but from my
perspective it'd be good to worry about collecting evidence as to
what happened exactly.  Capturing core files is a good start --- and
don't forget the possibility that there's more than one.  A plausible
guess as to why the system didn't restart is that the postmaster crashed
too, so we'd need to see its core to figure out why.

Anyway, I would not be afraid to try restarting the postmaster manually
if it died.  Maybe don't do that repeatedly without human intervention;
but PG is pretty robust against crashes.  We developers crash it all the
time, and we don't lose data.

            regards, tom lane


pgsql-general by date:

Previous
From: Benjamin Scherrey
Date:
Subject: Re: High WriteLatency RDS Postgres 9.3.20
Next
From: Łukasz Jarych
Date:
Subject: Re: Run Stored procedure - function from VBA