Re: What to do when dynamic shared memory control segment is corrupt - Mailing list pgsql-general

From Sherrylyn Branchaw
Subject Re: What to do when dynamic shared memory control segment is corrupt
Date
Msg-id CAB_myF5EaCVsBQ24rb4gLeLSau+Gv0otY9Y6nk5xnpw5LvYv7Q@mail.gmail.com
Whole thread Raw
In response to Re: What to do when dynamic shared memory control segment is corrupt  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: What to do when dynamic shared memory control segment is corrupt  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
> Hm ... were these installations built with --enable-cassert?  If not,
> an abort trap seems pretty odd.

The packages are installed directly from the yum repos for RHEL. I'm not aware that --enable-cassert is being used, and we're certainly not installing from source.

> Those "incomplete data" messages are quite unexpected and disturbing.
> I don't know of any mechanism within Postgres proper that would result
> in corruption of the postmaster.pid file that way.  (I wondered briefly
> if trying to start a conflicting postmaster would result in such a
> situation, but experimentation here says not.)  I'm suspicious that
> this may indicate a bug or unwarranted assumption in whatever scripts
> you use to start/stop the postmaster.  Whether that is at all related
> to your crash issue is hard to say, but it bears looking into.

We're using the stock initd script from the yum repo, but I dug into this issue, and it looks like we're passing the path to the postmaster.pid as the $pidfile variable in our sysconfig file, meaning the initd script is managing the postmaster.pid file, and specifically is overwriting it with a single line containing just the pid. I'm not sure why it's set up like this, and I'm thinking we should change it, but it seems harmless and unrelated to the crash. In particular, manual initd actions such as stop, start, restart, and status all work fine.

> No, that looks like fairly typical crash recovery to me: corrupt shared
> memory contents are expected and recovered from after a crash. 

That's reassuring. But if it's safe for us to immediately start the server back up, why did Postgres not automatically start the server up like it did the first time? I was assuming it was due to the presence of the corrupt memory segment, as that was the only difference in the logs, although I could be wrong. Automatic restart would have saved us a great deal of downtime; since in the first case we had total recovery within 30 seconds, and in the second case, many minutes of downtime while someone got paged, troubleshot the issue, and eventually decided to try starting the database back up.

At any rate, if it's safe, we can write a script to detect this failure mode and automatically restart, although it would be less error-prone if Postgres restarted automatically.

> Hm, I supposed that Sherrylyn would've noticed any PANIC entries in
> the log.

No PANICs. The log lines I pasted were the only ones that looked relevant in the Postgres logs. I can try to dig through the application logs, but I was planning to wait until the next time this happens, since we should have core dumps fixed and that might make things easier.

> What extensions are installed, if any?

In the first database, the one without the corrupt memory segment and that restarted automatically: plpgsql and postgres_fdw.

In the second database, the one where the memory segment got corrupted and that didn't restart automatically: dblink, hstore, pg_trgm, pgstattuple, plpgsql, and tablefunc.

I forgot to mention that the queries that got killed were innocuous-looking SELECTs that completed successfully for me in less than a second when I ran them manually. In other words, the problem was not reproducible.

Sherrylyn

pgsql-general by date:

Previous
From: Rob Sargent
Date:
Subject: Re: Run Stored procedure - function from VBA
Next
From: Robert Creager
Date:
Subject: Re: Query hitting empty tables taking 48 minutes