Thread: server crash with "process 22821 releasing ProcSignal slot 32, but it contains 0"
server crash with "process 22821 releasing ProcSignal slot 32, but it contains 0"
From
Merlin Moncure
Date:
I got a crash on a production server this morning following during very heavy load. This is postgres 9.1.3 on Linux (going to 9.1.4 asap). I didn't catch a core dump it doesn't look like one would help anyways. Server is virtualized quad on vmware. Here is the log I have: 2012-06-25 09:07:45 CDT [postgres@ysanalysis]: STATEMENT: CREATE PROCEDURAL LANGUAGE plpgsql; 2012-06-25 09:07:45 CDT [postgres@ysanalysis]: ERROR: language "plpgsql" already exists 2012-06-25 09:07:45 CDT [postgres@ysanalysis]: STATEMENT: CREATE PROCEDURAL LANGUAGE plpgsql; 2012-06-25 09:08:08 CDT [postgres@ysanalysis_hes]: LOG: could not send data to client: Broken pipe 2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG: unexpected EOF on client connection 2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG: process 22821 releasing ProcSignal slot 32, but it contains 0 2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG: failed to find proc 0x7f48617e2ab0 in ProcArray 2012-06-25 09:08:20 CDT [postgres@ysanalysis]: ERROR: language "plpgsql" already exists 2012-06-25 09:08:20 CDT [postgres@ysanalysis]: STATEMENT: CREATE PROCEDURAL LANGUAGE plpgsql; 2012-06-25 09:08:21 CDT [postgres@ysanalysis]: ERROR: language "plpgsql" already exists 2012-06-25 09:08:21 CDT [postgres@ysanalysis]: STATEMENT: CREATE PROCEDURAL LANGUAGE plpgsql; 2012-06-25 09:08:24 CDT [postgres@ysanalysis_hes]: FATAL: latch already owned 2012-06-25 09:08:24 CDT [@]: LOG: server process (PID 23323) exited with exit code 1 2012-06-25 09:08:24 CDT [@]: LOG: terminating any other active server processes 2012-06-25 09:08:24 CDT [postgres@postgres]: WARNING: terminating connection because of crash of another server process 2012-06-25 09:08:24 CDT [postgres@postgres]: DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. The plpgsql errors are expected. The background here is I have four threads constantly loading small databases in bash and there was one long running ETL process in the background as well. merlin
Re: server crash with "process 22821 releasing ProcSignal slot 32, but it contains 0"
From
Tom Lane
Date:
Merlin Moncure <mmoncure@gmail.com> writes: > 2012-06-25 09:08:08 CDT [postgres@ysanalysis_hes]: LOG: could not > send data to client: Broken pipe > 2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG: unexpected > EOF on client connection > 2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG: process 22821 > releasing ProcSignal slot 32, but it contains 0 > 2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG: failed to > find proc 0x7f48617e2ab0 in ProcArray > [and a bit later] > 2012-06-25 09:08:24 CDT [postgres@ysanalysis_hes]: FATAL: latch already owned I think what we're looking at here is a screw-up in the process shutdown sequence. Perhaps caused by bad recovery from an attempt to send an error message to the already-disconnected client; but that's just speculation, and it's hard to see how to get more info without a core dump. I wonder whether we shouldn't promote some or all of these three error cases to PANIC, as they certainly suggest shared-memory corruption. And if it did panic, we could hope to get a core dump for debugging purposes. regards, tom lane
Re: server crash with "process 22821 releasing ProcSignal slot 32, but it contains 0"
From
Merlin Moncure
Date:
On Mon, Jun 25, 2012 at 9:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Merlin Moncure <mmoncure@gmail.com> writes: >> 2012-06-25 09:08:08 CDT [postgres@ysanalysis_hes]: LOG: =A0could not >> send data to client: Broken pipe >> 2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG: =A0unexpected >> EOF on client connection >> 2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG: =A0process 22821 >> releasing ProcSignal slot 32, but it contains 0 >> 2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG: =A0failed to >> find proc 0x7f48617e2ab0 in ProcArray >> [and a bit later] >> 2012-06-25 09:08:24 CDT [postgres@ysanalysis_hes]: FATAL: =A0latch alrea= dy owned > > I think what we're looking at here is a screw-up in the process shutdown > sequence. =A0Perhaps caused by bad recovery from an attempt to send an > error message to the already-disconnected client; but that's just > speculation, and it's hard to see how to get more info without a core > dump. > > I wonder whether we shouldn't promote some or all of these three error > cases to PANIC, as they certainly suggest shared-memory corruption. > And if it did panic, we could hope to get a core dump for debugging > purposes. Ok, I'll look into reproducing the crash conditions. Unfortunately this is a critical server and it crashed during a time sensitive process. I can schedule a maintenance window though but it will have to wait a bit. merlin
Re: server crash with "process 22821 releasing ProcSignal slot 32, but it contains 0"
From
Merlin Moncure
Date:
On Mon, Jun 25, 2012 at 10:03 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Mon, Jun 25, 2012 at 9:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Merlin Moncure <mmoncure@gmail.com> writes: >>> 2012-06-25 09:08:08 CDT [postgres@ysanalysis_hes]: LOG: =A0could not >>> send data to client: Broken pipe >>> 2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG: =A0unexpected >>> EOF on client connection >>> 2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG: =A0process 22821 >>> releasing ProcSignal slot 32, but it contains 0 >>> 2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG: =A0failed to >>> find proc 0x7f48617e2ab0 in ProcArray >>> [and a bit later] >>> 2012-06-25 09:08:24 CDT [postgres@ysanalysis_hes]: FATAL: =A0latch alre= ady owned >> >> I think what we're looking at here is a screw-up in the process shutdown >> sequence. =A0Perhaps caused by bad recovery from an attempt to send an >> error message to the already-disconnected client; but that's just >> speculation, and it's hard to see how to get more info without a core >> dump. >> >> I wonder whether we shouldn't promote some or all of these three error >> cases to PANIC, as they certainly suggest shared-memory corruption. >> And if it did panic, we could hope to get a core dump for debugging >> purposes. > > Ok, I'll look into reproducing the crash conditions. =A0Unfortunately > this is a critical server and it crashed during a time sensitive > process. I can schedule a maintenance window though but it will have > to wait a bit. > > merlin I have some good news: this was reproduce and i I believe it to be operator invoked: 2012-06-26 09:12:19 CDT [postgres@ysanalysis_hes]: ERROR: index "idx_lease_expiremonth2" does not exist 2012-06-26 09:12:19 CDT [postgres@ysanalysis_hes]: STATEMENT: DROP INDEX idx_Lease_ExpireMonth2; 2012-06-26 09:15:10 CDT [rms@ysanalysis]: LOG: unexpected EOF on client connection 2012-06-26 09:15:10 CDT [rms@ysanalysis]: LOG: process 10340 releasing ProcSignal slot 5, but it contains 0 2012-06-26 09:15:10 CDT [rms@ysanalysis]: LOG: failed to find proc 0x7f48617e6310 in ProcArray 2012-06-26 09:16:48 CDT [rms@ysanalysis]: FATAL: latch already owned 2012-06-26 09:16:48 CDT [@]: LOG: server process (PID 10928) exited with exit code 1 2012-06-26 09:16:48 CDT [@]: LOG: terminating any other active server proc= esses 2012-06-26 09:16:48 CDT [postgres@postgres]: WARNING: terminating connection because of crash of another server process ...investigating... merlin
Re: server crash with "process 22821 releasing ProcSignal slot 32, but it contains 0"
From
Merlin Moncure
Date:
On Tue, Jun 26, 2012 at 9:19 AM, Merlin Moncure <mmoncure@gmail.com> wrote: >> Ok, I'll look into reproducing the crash conditions. =A0Unfortunately >> this is a critical server and it crashed during a time sensitive >> process. I can schedule a maintenance window though but it will have >> to wait a bit. I suspect (but haven't had time to prove and may not for several days -- unfortunately going on vacation momentarily) that this might be caused by pl/sh. In particular, we have a routine that was inadvertently applied to the database in with windows cr/lf instead of the normal linux newline. This is an older version of plgplsh (1.3) and is maybe minus some relevant bug fixes. Just wanted to give a heads up so that you didn't waste time investigating. I will follow up on this eventually. merlin
Re: server crash with "process 22821 releasing ProcSignal slot 32, but it contains 0"
From
Tom Lane
Date:
Merlin Moncure <mmoncure@gmail.com> writes: > I suspect (but haven't had time to prove and may not for several days > -- unfortunately going on vacation momentarily) that this might be > caused by pl/sh. Hm. The reported symptoms might be explainable if something had caused multiple threads to become active within the backend process --- then it would be plausible for it to try to do proc_exit cleanup twice. Which would explain the first two errors, though I'm not sure how that leads to failing to disown the process latch, as the third error suggests must have happened. But I don't know enough about pl/sh to know if it could cause threading activation. > In particular, we have a routine that was > inadvertently applied to the database in with windows cr/lf instead of > the normal linux newline. This doesn't seem real promising as an explanation ... regards, tom lane
Re: server crash with "process 22821 releasing ProcSignal slot 32, but it contains 0"
From
Merlin Moncure
Date:
On Tue, Jun 26, 2012 at 12:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Merlin Moncure <mmoncure@gmail.com> writes: >> I suspect (but haven't had time to prove and may not for several days >> -- unfortunately going on vacation momentarily) that this might be >> caused by pl/sh. > > Hm. =A0The reported symptoms might be explainable if something had caused > multiple threads to become active within the backend process --- then > it would be plausible for it to try to do proc_exit cleanup twice. > Which would explain the first two errors, though I'm not sure how that > leads to failing to disown the process latch, as the third error > suggests must have happened. =A0But I don't know enough about pl/sh to > know if it could cause threading activation. > >> In particular, we have a routine that was >> inadvertently applied to the database in with windows cr/lf instead of >> the normal linux newline. > > This doesn't seem real promising as an explanation ... right -- just a suspicion. maybe the relevant point was that it immediately failed. operator invoking the busted routine (which I had to fix) and the crash were highly correlated, although it does not always crash. yesterday was very heavy load and today not so much. merlin
Re: server crash with "process 22821 releasing ProcSignal slot 32, but it contains 0"
From
Merlin Moncure
Date:
On Tue, Jun 26, 2012 at 12:09 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Tue, Jun 26, 2012 at 12:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Merlin Moncure <mmoncure@gmail.com> writes: >>> I suspect (but haven't had time to prove and may not for several days >>> -- unfortunately going on vacation momentarily) that this might be >>> caused by pl/sh. >> >> Hm. The reported symptoms might be explainable if something had caused >> multiple threads to become active within the backend process --- then >> it would be plausible for it to try to do proc_exit cleanup twice. >> Which would explain the first two errors, though I'm not sure how that >> leads to failing to disown the process latch, as the third error >> suggests must have happened. But I don't know enough about pl/sh to >> know if it could cause threading activation. >> >>> In particular, we have a routine that was >>> inadvertently applied to the database in with windows cr/lf instead of >>> the normal linux newline. >> >> This doesn't seem real promising as an explanation ... > > right -- just a suspicion. maybe the relevant point was that it > immediately failed. operator invoking the busted routine (which I had > to fix) and the crash were highly correlated, although it does not > always crash. yesterday was very heavy load and today not so much. Follow up on this. It is pl/sh and it is a newline issue: one of the developers is using a tool (I think pgadmin?) that is sticking \r characters at the end of every line which is throwing off pl/sh's shebang parsing. The issuing query gets an error along the lines of 'could not exec' and the server goes belly up if there is significant concurrent load when that's issued. This is an out of date pl/sh, so I'm going to upgrade it and try and reproduce. If I still can, I'll supply a test case. merlin
Re: server crash with "process 22821 releasing ProcSignal slot 32, but it contains 0"
From
Peter Eisentraut
Date:
On Thu, 2012-08-09 at 16:26 -0500, Merlin Moncure wrote: > Follow up on this. It is pl/sh and it is a newline issue: one of the > developers is using a tool (I think pgadmin?) that is sticking \r > characters at the end of every line which is throwing off pl/sh's > shebang parsing. The issuing query gets an error along the lines of > 'could not exec' and the server goes belly up if there is significant > concurrent load when that's issued. This is an out of date pl/sh, so > I'm going to upgrade it and try and reproduce. If I still can, I'll > supply a test case. I had received an independent report of this cr/lf issue and fixed it now. But that doesn't explain why the server would crash.