Thread: server crash with "process 22821 releasing ProcSignal slot 32, but it contains 0"

I got a crash on a production server this morning following during
very heavy load.   This is postgres 9.1.3 on Linux (going to 9.1.4
asap).  I didn't catch a core dump it doesn't look like one would help
anyways.  Server is virtualized quad on vmware.  Here is the log I
have:

2012-06-25 09:07:45 CDT [postgres@ysanalysis]: STATEMENT:  CREATE
PROCEDURAL LANGUAGE plpgsql;
2012-06-25 09:07:45 CDT [postgres@ysanalysis]: ERROR:  language
"plpgsql" already exists
2012-06-25 09:07:45 CDT [postgres@ysanalysis]: STATEMENT:  CREATE
PROCEDURAL LANGUAGE plpgsql;
2012-06-25 09:08:08 CDT [postgres@ysanalysis_hes]: LOG:  could not
send data to client: Broken pipe
2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG:  unexpected
EOF on client connection
2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG:  process 22821
releasing ProcSignal slot 32, but it contains 0
2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG:  failed to
find proc 0x7f48617e2ab0 in ProcArray
2012-06-25 09:08:20 CDT [postgres@ysanalysis]: ERROR:  language
"plpgsql" already exists
2012-06-25 09:08:20 CDT [postgres@ysanalysis]: STATEMENT:  CREATE
PROCEDURAL LANGUAGE plpgsql;
2012-06-25 09:08:21 CDT [postgres@ysanalysis]: ERROR:  language
"plpgsql" already exists
2012-06-25 09:08:21 CDT [postgres@ysanalysis]: STATEMENT:  CREATE
PROCEDURAL LANGUAGE plpgsql;
2012-06-25 09:08:24 CDT [postgres@ysanalysis_hes]: FATAL:  latch already owned
2012-06-25 09:08:24 CDT [@]: LOG:  server process (PID 23323) exited
with exit code 1
2012-06-25 09:08:24 CDT [@]: LOG:  terminating any other active server processes
2012-06-25 09:08:24 CDT [postgres@postgres]: WARNING:  terminating
connection because of crash of another server process
2012-06-25 09:08:24 CDT [postgres@postgres]: DETAIL:  The postmaster
has commanded this server process to roll back the current transaction
and exit, because another server process exited abnormally and
possibly corrupted shared memory.

The plpgsql errors are expected.  The background here is I have four
threads constantly loading small databases in bash and there was one
long running ETL process in the background as well.

merlin
Merlin Moncure <mmoncure@gmail.com> writes:
> 2012-06-25 09:08:08 CDT [postgres@ysanalysis_hes]: LOG:  could not
> send data to client: Broken pipe
> 2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG:  unexpected
> EOF on client connection
> 2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG:  process 22821
> releasing ProcSignal slot 32, but it contains 0
> 2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG:  failed to
> find proc 0x7f48617e2ab0 in ProcArray
> [and a bit later]
> 2012-06-25 09:08:24 CDT [postgres@ysanalysis_hes]: FATAL:  latch already owned

I think what we're looking at here is a screw-up in the process shutdown
sequence.  Perhaps caused by bad recovery from an attempt to send an
error message to the already-disconnected client; but that's just
speculation, and it's hard to see how to get more info without a core
dump.

I wonder whether we shouldn't promote some or all of these three error
cases to PANIC, as they certainly suggest shared-memory corruption.
And if it did panic, we could hope to get a core dump for debugging
purposes.

            regards, tom lane
On Mon, Jun 25, 2012 at 9:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Merlin Moncure <mmoncure@gmail.com> writes:
>> 2012-06-25 09:08:08 CDT [postgres@ysanalysis_hes]: LOG: =A0could not
>> send data to client: Broken pipe
>> 2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG: =A0unexpected
>> EOF on client connection
>> 2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG: =A0process 22821
>> releasing ProcSignal slot 32, but it contains 0
>> 2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG: =A0failed to
>> find proc 0x7f48617e2ab0 in ProcArray
>> [and a bit later]
>> 2012-06-25 09:08:24 CDT [postgres@ysanalysis_hes]: FATAL: =A0latch alrea=
dy owned
>
> I think what we're looking at here is a screw-up in the process shutdown
> sequence. =A0Perhaps caused by bad recovery from an attempt to send an
> error message to the already-disconnected client; but that's just
> speculation, and it's hard to see how to get more info without a core
> dump.
>
> I wonder whether we shouldn't promote some or all of these three error
> cases to PANIC, as they certainly suggest shared-memory corruption.
> And if it did panic, we could hope to get a core dump for debugging
> purposes.

Ok, I'll look into reproducing the crash conditions.  Unfortunately
this is a critical server and it crashed during a time sensitive
process. I can schedule a maintenance window though but it will have
to wait a bit.

merlin
On Mon, Jun 25, 2012 at 10:03 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Mon, Jun 25, 2012 at 9:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Merlin Moncure <mmoncure@gmail.com> writes:
>>> 2012-06-25 09:08:08 CDT [postgres@ysanalysis_hes]: LOG: =A0could not
>>> send data to client: Broken pipe
>>> 2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG: =A0unexpected
>>> EOF on client connection
>>> 2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG: =A0process 22821
>>> releasing ProcSignal slot 32, but it contains 0
>>> 2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG: =A0failed to
>>> find proc 0x7f48617e2ab0 in ProcArray
>>> [and a bit later]
>>> 2012-06-25 09:08:24 CDT [postgres@ysanalysis_hes]: FATAL: =A0latch alre=
ady owned
>>
>> I think what we're looking at here is a screw-up in the process shutdown
>> sequence. =A0Perhaps caused by bad recovery from an attempt to send an
>> error message to the already-disconnected client; but that's just
>> speculation, and it's hard to see how to get more info without a core
>> dump.
>>
>> I wonder whether we shouldn't promote some or all of these three error
>> cases to PANIC, as they certainly suggest shared-memory corruption.
>> And if it did panic, we could hope to get a core dump for debugging
>> purposes.
>
> Ok, I'll look into reproducing the crash conditions. =A0Unfortunately
> this is a critical server and it crashed during a time sensitive
> process. I can schedule a maintenance window though but it will have
> to wait a bit.
>
> merlin

I have some good news: this was reproduce and i I believe it to be
operator invoked:


2012-06-26 09:12:19 CDT [postgres@ysanalysis_hes]: ERROR:  index
"idx_lease_expiremonth2" does not exist
2012-06-26 09:12:19 CDT [postgres@ysanalysis_hes]: STATEMENT:  DROP
INDEX idx_Lease_ExpireMonth2;
2012-06-26 09:15:10 CDT [rms@ysanalysis]: LOG:  unexpected EOF on
client connection
2012-06-26 09:15:10 CDT [rms@ysanalysis]: LOG:  process 10340
releasing ProcSignal slot 5, but it contains 0
2012-06-26 09:15:10 CDT [rms@ysanalysis]: LOG:  failed to find proc
0x7f48617e6310 in ProcArray
2012-06-26 09:16:48 CDT [rms@ysanalysis]: FATAL:  latch already owned
2012-06-26 09:16:48 CDT [@]: LOG:  server process (PID 10928) exited
with exit code 1
2012-06-26 09:16:48 CDT [@]: LOG:  terminating any other active server proc=
esses
2012-06-26 09:16:48 CDT [postgres@postgres]: WARNING:  terminating
connection because of crash of another server process

...investigating...

merlin
On Tue, Jun 26, 2012 at 9:19 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> Ok, I'll look into reproducing the crash conditions. =A0Unfortunately
>> this is a critical server and it crashed during a time sensitive
>> process. I can schedule a maintenance window though but it will have
>> to wait a bit.

I suspect (but haven't had time to prove and may not for several days
-- unfortunately going on vacation momentarily) that this might be
caused by pl/sh.  In particular, we have a routine that was
inadvertently applied to the database in with windows cr/lf instead of
the normal linux newline.  This is an older version of plgplsh (1.3)
and is maybe minus some relevant bug fixes.  Just wanted to give a
heads up so that you didn't waste time investigating.  I will follow
up on this eventually.

merlin
Merlin Moncure <mmoncure@gmail.com> writes:
> I suspect (but haven't had time to prove and may not for several days
> -- unfortunately going on vacation momentarily) that this might be
> caused by pl/sh.

Hm.  The reported symptoms might be explainable if something had caused
multiple threads to become active within the backend process --- then
it would be plausible for it to try to do proc_exit cleanup twice.
Which would explain the first two errors, though I'm not sure how that
leads to failing to disown the process latch, as the third error
suggests must have happened.  But I don't know enough about pl/sh to
know if it could cause threading activation.

> In particular, we have a routine that was
> inadvertently applied to the database in with windows cr/lf instead of
> the normal linux newline.

This doesn't seem real promising as an explanation ...

            regards, tom lane
On Tue, Jun 26, 2012 at 12:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Merlin Moncure <mmoncure@gmail.com> writes:
>> I suspect (but haven't had time to prove and may not for several days
>> -- unfortunately going on vacation momentarily) that this might be
>> caused by pl/sh.
>
> Hm. =A0The reported symptoms might be explainable if something had caused
> multiple threads to become active within the backend process --- then
> it would be plausible for it to try to do proc_exit cleanup twice.
> Which would explain the first two errors, though I'm not sure how that
> leads to failing to disown the process latch, as the third error
> suggests must have happened. =A0But I don't know enough about pl/sh to
> know if it could cause threading activation.
>
>> In particular, we have a routine that was
>> inadvertently applied to the database in with windows cr/lf instead of
>> the normal linux newline.
>
> This doesn't seem real promising as an explanation ...

right -- just a suspicion.  maybe the relevant point was that it
immediately failed.  operator invoking the busted routine (which I had
to fix) and the crash were highly correlated, although it does not
always crash.  yesterday  was very heavy load and today not so much.

merlin
On Tue, Jun 26, 2012 at 12:09 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Tue, Jun 26, 2012 at 12:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Merlin Moncure <mmoncure@gmail.com> writes:
>>> I suspect (but haven't had time to prove and may not for several days
>>> -- unfortunately going on vacation momentarily) that this might be
>>> caused by pl/sh.
>>
>> Hm.  The reported symptoms might be explainable if something had caused
>> multiple threads to become active within the backend process --- then
>> it would be plausible for it to try to do proc_exit cleanup twice.
>> Which would explain the first two errors, though I'm not sure how that
>> leads to failing to disown the process latch, as the third error
>> suggests must have happened.  But I don't know enough about pl/sh to
>> know if it could cause threading activation.
>>
>>> In particular, we have a routine that was
>>> inadvertently applied to the database in with windows cr/lf instead of
>>> the normal linux newline.
>>
>> This doesn't seem real promising as an explanation ...
>
> right -- just a suspicion.  maybe the relevant point was that it
> immediately failed.  operator invoking the busted routine (which I had
> to fix) and the crash were highly correlated, although it does not
> always crash.  yesterday  was very heavy load and today not so much.

Follow up on this.  It is pl/sh and it is a newline issue: one of the
developers is using a tool (I think pgadmin?) that is sticking \r
characters at the end of every line which is throwing off pl/sh's
shebang parsing.  The issuing query gets an error along the lines of
'could not exec' and the server goes belly up if there is significant
concurrent load when that's issued.  This is an out of date pl/sh, so
I'm going to upgrade it and try and reproduce. If I still can, I'll
supply a test case.

merlin
On Thu, 2012-08-09 at 16:26 -0500, Merlin Moncure wrote:
> Follow up on this.  It is pl/sh and it is a newline issue: one of the
> developers is using a tool (I think pgadmin?) that is sticking \r
> characters at the end of every line which is throwing off pl/sh's
> shebang parsing.  The issuing query gets an error along the lines of
> 'could not exec' and the server goes belly up if there is significant
> concurrent load when that's issued.  This is an out of date pl/sh, so
> I'm going to upgrade it and try and reproduce. If I still can, I'll
> supply a test case.

I had received an independent report of this cr/lf issue and fixed it
now.  But that doesn't explain why the server would crash.