Re: What to do when dynamic shared memory control segment is corrupt - Mailing list pgsql-general
From | Tom Lane |
---|---|
Subject | Re: What to do when dynamic shared memory control segment is corrupt |
Date | |
Msg-id | 13596.1529379612@sss.pgh.pa.us Whole thread Raw |
In response to | Re: What to do when dynamic shared memory control segment is corrupt (Sherrylyn Branchaw <sbranchaw@gmail.com>) |
Responses |
Re: What to do when dynamic shared memory control segment is corrupt
|
List | pgsql-general |
Sherrylyn Branchaw <sbranchaw@gmail.com> writes: >> Hm ... were these installations built with --enable-cassert? If not, >> an abort trap seems pretty odd. > The packages are installed directly from the yum repos for RHEL. I'm not > aware that --enable-cassert is being used, and we're certainly not > installing from source. OK, I'm pretty sure nobody builds production RPMs with --enable-cassert. But your extensions (as listed below) don't include any C++ code, so that still leaves us wondering where the abort trap came from. A stack trace would almost certainly help clear that up. >> Those "incomplete data" messages are quite unexpected and disturbing. > We're using the stock initd script from the yum repo, but I dug into this > issue, and it looks like we're passing the path to the postmaster.pid as > the $pidfile variable in our sysconfig file, meaning the initd script is > managing the postmaster.pid file, and specifically is overwriting it with a > single line containing just the pid. I'm not sure why it's set up like > this, and I'm thinking we should change it, but it seems harmless and > unrelated to the crash. In particular, manual initd actions such as stop, > start, restart, and status all work fine. This is bad; a normal postmaster.pid file contains half a dozen lines besides the PID proper. You might get away with this for now, but it'll break pg_ctl as of v10 or so, and might confuse other external tools sooner than that. Still, it doesn't seem related to your crash problem. >> No, that looks like fairly typical crash recovery to me: corrupt shared >> memory contents are expected and recovered from after a crash. > That's reassuring. But if it's safe for us to immediately start the server > back up, why did Postgres not automatically start the server up like it did > the first time? Yeah, I'd like to know that too. The complaint about corrupt shared memory may be just an unrelated red herring, or it might be a separate effect of whatever the primary failure was ... but I think it was likely not the direct cause of the failure-to-restart. But we've got no real evidence as to what that direct cause was. > At any rate, if it's safe, we can write a script to detect this failure > mode and automatically restart, although it would be less error-prone if > Postgres restarted automatically. I realize that you're most focused on less-downtime, but from my perspective it'd be good to worry about collecting evidence as to what happened exactly. Capturing core files is a good start --- and don't forget the possibility that there's more than one. A plausible guess as to why the system didn't restart is that the postmaster crashed too, so we'd need to see its core to figure out why. Anyway, I would not be afraid to try restarting the postmaster manually if it died. Maybe don't do that repeatedly without human intervention; but PG is pretty robust against crashes. We developers crash it all the time, and we don't lose data. regards, tom lane
pgsql-general by date: