Thread: BUG #13643: Should a process dying bring postgresql down, or not?

BUG #13643: Should a process dying bring postgresql down, or not?

From
amir.rohan@mail.com
Date:
The following bug has been logged on the website:

Bug reference:      13643
Logged by:          Amir Rohan
Email address:      amir.rohan@mail.com
PostgreSQL version: 9.5alpha2
Operating system:   Linux
Description:

Seen on my box:

postgres     2181  0.0  0.1 134468  9504 pts/0    T    03:34   0:00
/usr/local/pgsql/bin/postgres -D /home/local/pg/s1
postgres     2183  0.0  0.0 134576  4168 ?        Ss   03:34   0:00
postgres: checkpointer process
postgres     2184  0.0  0.0 134604  2844 ?        Ss   03:34   0:00
postgres: writer process
postgres     2185  0.0  0.0 134468  2780 ?        Ss   03:34   0:00
postgres: wal writer process
postgres     2186  0.0  0.0      0     0 ?        Zs   03:34   0:00
[postgres] <defunct>         <<<<<<<<<<<<<<< dead process
postgres     2187  0.0  0.0 127300  2204 ?        Ss   03:34   0:00
postgres: stats collector process
postgres     2193  0.0  0.0 118164  2696 pts/0    T    03:34   0:00
pg_basebackup -D /home/local/pg/backup -p 57833 --format=t -x
postgres     2194  0.0  0.0 134916  6016 ?        Ss   03:34   0:00
postgres: wal sender process user1 [local] sending backup "pg_basebackup
base backup"

Not sure if this is a real problem or not, but it was my understanding that
pg panics when a subprocess dies for safety resons.

Re: BUG #13643: Should a process dying bring postgresql down, or not?

From
Alvaro Herrera
Date:
amir.rohan@mail.com wrote:

> postgres     2181  0.0  0.1 134468  9504 pts/0    T    03:34   0:00 /usr/local/pgsql/bin/postgres -D
/home/local/pg/s1
> postgres     2183  0.0  0.0 134576  4168 ?        Ss   03:34   0:00 postgres: checkpointer process
> postgres     2184  0.0  0.0 134604  2844 ?        Ss   03:34   0:00 postgres: writer process
> postgres     2185  0.0  0.0 134468  2780 ?        Ss   03:34   0:00 postgres: wal writer process
> postgres     2186  0.0  0.0      0     0 ?        Zs   03:34   0:00 [postgres] <defunct>         <<<<<<<<<<<<<<< dead
process
> postgres     2187  0.0  0.0 127300  2204 ?        Ss   03:34   0:00 postgres: stats collector process
> postgres     2193  0.0  0.0 118164  2696 pts/0    T    03:34   0:00 pg_basebackup -D /home/local/pg/backup -p 57833
--format=t-x 
> postgres     2194  0.0  0.0 134916  6016 ?        Ss   03:34   0:00 postgres: wal sender process user1 [local]
sendingbackup "pg_basebackup base backup" 
>
> Not sure if this is a real problem or not, but it was my understanding that
> pg panics when a subprocess dies for safety resons.

A zombie process is a process that died and hasn't been collected by its
parent process.  In this case, postmaster is stopped ("T" above), so it
cannot call wait() to collect the dead process.  Once you signal
postmaster to run again, it will either discover that the process died
cleanly (and clean up state and all is well), or that it died uncleanly
(in which case it will cause all other processes to stop).

That postmaster is in STOPped mode is the issue here.  That doesn't
happen unless you take specific action to do that.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #13643: Should a process dying bring postgresql down, or not?

From
Amir Rohan
Date:
On 09/27/2015 09:59 PM, Alvaro Herrera wrote:
> amir.rohan@mail.com wrote:
>
>> postgres     2181  0.0  0.1 134468  9504 pts/0    T    03:34   0:00 /usr/local/pgsql/bin/postgres -D
/home/local/pg/s1
>> postgres     2183  0.0  0.0 134576  4168 ?        Ss   03:34   0:00 postgres: checkpointer process
>> postgres     2184  0.0  0.0 134604  2844 ?        Ss   03:34   0:00 postgres: writer process
>> postgres     2185  0.0  0.0 134468  2780 ?        Ss   03:34   0:00 postgres: wal writer process
>> postgres     2186  0.0  0.0      0     0 ?        Zs   03:34   0:00 [postgres] <defunct>         <<<<<<<<<<<<<<<
deadprocess 
>> postgres     2187  0.0  0.0 127300  2204 ?        Ss   03:34   0:00 postgres: stats collector process
>> postgres     2193  0.0  0.0 118164  2696 pts/0    T    03:34   0:00 pg_basebackup -D /home/local/pg/backup -p 57833
--format=t-x 
>> postgres     2194  0.0  0.0 134916  6016 ?        Ss   03:34   0:00 postgres: wal sender process user1 [local]
sendingbackup "pg_basebackup base backup" 
>
> That postmaster is in STOPped mode is the issue here.  That doesn't
> happen unless you take specific action to do that.
>


I hadn't noticed that.  That looks like I suspended pg_ctl during start,
 but with the backup in progress already, it's not clear how I managed
that state. There was no kill -SIGSTOP involved...

After killing some subprocesses in random I do see postgres
restarting the whole group once one goes down, if/once its
running/unsuspended.

Excuse the noise.

Amir

Re: BUG #13643: Should a process dying bring postgresql down, or not?

From
Alvaro Herrera
Date:
Amir Rohan wrote:
> On 09/27/2015 09:59 PM, Alvaro Herrera wrote:
> > amir.rohan@mail.com wrote:
> >
> >> postgres     2181  0.0  0.1 134468  9504 pts/0    T    03:34   0:00 /usr/local/pgsql/bin/postgres -D
/home/local/pg/s1
> >> postgres     2183  0.0  0.0 134576  4168 ?        Ss   03:34   0:00 postgres: checkpointer process
> >> postgres     2184  0.0  0.0 134604  2844 ?        Ss   03:34   0:00 postgres: writer process
> >> postgres     2185  0.0  0.0 134468  2780 ?        Ss   03:34   0:00 postgres: wal writer process
> >> postgres     2186  0.0  0.0      0     0 ?        Zs   03:34   0:00 [postgres] <defunct>         <<<<<<<<<<<<<<<
deadprocess 
> >> postgres     2187  0.0  0.0 127300  2204 ?        Ss   03:34   0:00 postgres: stats collector process
> >> postgres     2193  0.0  0.0 118164  2696 pts/0    T    03:34   0:00 pg_basebackup -D /home/local/pg/backup -p
57833--format=t -x 
> >> postgres     2194  0.0  0.0 134916  6016 ?        Ss   03:34   0:00 postgres: wal sender process user1 [local]
sendingbackup "pg_basebackup base backup" 
> >
> > That postmaster is in STOPped mode is the issue here.  That doesn't
> > happen unless you take specific action to do that.
>
> I hadn't noticed that.  That looks like I suspended pg_ctl during start,
>  but with the backup in progress already, it's not clear how I managed
> that state. There was no kill -SIGSTOP involved...

Suspending a process *is* sending sigstop.  You may not have sent
sigstop explicitely, but the shell would have done it if you suspended
the process.

Since pg_ctl is not normally long-lived, I'm not sure how you ended up
suspending it.

> After killing some subprocesses in random I do see postgres
> restarting the whole group once one goes down, if/once its
> running/unsuspended.

Well, doing things randomly is unlikely to teach you much ...

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #13643: Should a process dying bring postgresql down, or not?

From
"Amir Rohan"
Date:
>
>
> Sent: Monday, September 28, 2015 at 12:06 AM
> From: "Alvaro Herrera" <alvherre@2ndquadrant.com>
> To: "Amir Rohan" <amir.rohan@mail.com>
> Cc: pgsql-bugs@postgresql.org
> Subject: Re: BUG #13643: Should a process dying bring postgresql down, or not?
> Amir Rohan wrote:
>> On 09/27/2015 09:59 PM, Alvaro Herrera wrote:
>> > amir.rohan@mail.com wrote:
>> >
>> >> postgres 2181 0.0 0.1 134468 9504 pts/0 T 03:34 0:00 /usr/local/pgsql/bin/postgres -D
/home/local/pg/s1
>> >> postgres 2183 0.0 0.0 134576 4168 ? Ss 03:34 0:00 postgres: checkpointer process
>> >> postgres 2184 0.0 0.0 134604 2844 ? Ss 03:34 0:00 postgres: writer process
>> >> postgres 2185 0.0 0.0 134468 2780 ? Ss 03:34 0:00 postgres: wal writer process
>> >> postgres 2186 0.0 0.0 0 0 ? Zs 03:34 0:00 [postgres] <defunct>
<<<<<<<<<<<<<<<dead process 
>> >> postgres 2187 0.0 0.0 127300 2204 ? Ss 03:34 0:00 postgres: stats collector process
>> >> postgres 2193 0.0 0.0 118164 2696 pts/0 T 03:34 0:00 pg_basebackup -D /home/local/pg/backup -p 57833
--format=t-x 
>> >> postgres 2194 0.0 0.0 134916 6016 ? Ss 03:34 0:00 postgres: wal sender process user1 [local] sending
backup"pg_basebackup base backup" 
>> >
>> > That postmaster is in STOPped mode is the issue here. That doesn't
>> > happen unless you take specific action to do that.
>>
>> I hadn't noticed that. That looks like I suspended pg_ctl during start,
>> but with the backup in progress already, it's not clear how I managed
>> that state. There was no kill -SIGSTOP involved...
>
> Suspending a process *is* sending sigstop. You may not have sent
> sigstop explicitely, but the shell would have done it if you suspended
> the process.
>

I *know*. But as you can see that backup process is already underway.
That means pg_ctl had returned by then, and I had issued the pg_basebackup command. Since I didn't manually send a
SIGSTOP,
and postgres was already detached by then, I don't know how it
could have gotten suspended.

> Since pg_ctl is not normally long-lived, I'm not sure how you ended up
> suspending it.
>

exactly.

>> After killing some subprocesses in random I do see postgres
>> restarting the whole group once one goes down, if/once its
>> running/unsuspended.
>
> Well, doing things randomly is unlikely to teach you much ...
>

Well, It can teach you which electric socket will
electrocute you when poked with a fork. That's useful data.

Amir
 

Re: BUG #13643: Should a process dying bring postgresql down, or not?

From
Amir Rohan
Date:
On 09/28/2015 12:06 AM, Alvaro Herrera wrote:
> Amir Rohan wrote:
>> On 09/27/2015 09:59 PM, Alvaro Herrera wrote:
>>> amir.rohan@mail.com wrote:
>>>
>>>> postgres     2181  0.0  0.1 134468  9504 pts/0    T    03:34   0:00 /usr/local/pgsql/bin/postgres -D
/home/local/pg/s1
>>>> postgres     2183  0.0  0.0 134576  4168 ?        Ss   03:34   0:00 postgres: checkpointer process
>>>> postgres     2184  0.0  0.0 134604  2844 ?        Ss   03:34   0:00 postgres: writer process
>>>> postgres     2185  0.0  0.0 134468  2780 ?        Ss   03:34   0:00 postgres: wal writer process
>>>> postgres     2186  0.0  0.0      0     0 ?        Zs   03:34   0:00 [postgres] <defunct>         <<<<<<<<<<<<<<<
deadprocess 
>>>> postgres     2187  0.0  0.0 127300  2204 ?        Ss   03:34   0:00 postgres: stats collector process
>>>> postgres     2193  0.0  0.0 118164  2696 pts/0    T    03:34   0:00 pg_basebackup -D /home/local/pg/backup -p
57833--format=t -x 
>>>> postgres     2194  0.0  0.0 134916  6016 ?        Ss   03:34   0:00 postgres: wal sender process user1 [local]
sendingbackup "pg_basebackup base backup" 
>>>
>>> That postmaster is in STOPped mode is the issue here.  That doesn't
>>> happen unless you take specific action to do that.
>>
>> I hadn't noticed that.  That looks like I suspended pg_ctl during start,
>>  but with the backup in progress already, it's not clear how I managed
>> that state. There was no kill -SIGSTOP involved...
>
> Suspending a process *is* sending sigstop.  You may not have sent
> sigstop explicitely, but the shell would have done it if you suspended
> the process.
>
> Since pg_ctl is not normally long-lived, I'm not sure how you ended up
> suspending it.
>
>> After killing some subprocesses in random I do see postgres
>> restarting the whole group once one goes down, if/once its
>> running/unsuspended.
>
> Well, doing things randomly is unlikely to teach you much ...
>

Pardon my earlier HTML response, I had to use the webmail interface at
the time. Sending again as text.

>
>
> Sent: Monday, September 28, 2015 at 12:06 AM
> From: "Alvaro Herrera" <alvherre@2ndquadrant.com>
> To: "Amir Rohan" <amir.rohan@mail.com>
> Cc: pgsql-bugs@postgresql.org
> Subject: Re: BUG #13643: Should a process dying bring postgresql down,
or not?

> Amir Rohan wrote:
>> On 09/27/2015 09:59 PM, Alvaro Herrera wrote:
>> > amir.rohan@mail.com wrote:
>> >
>> >> postgres 2181 0.0 0.1 134468 9504 pts/0 T 03:34 0:00
/usr/local/pgsql/bin/postgres -D /home/local/pg/s1
>> >> postgres 2183 0.0 0.0 134576 4168 ? Ss 03:34 0:00 postgres:
checkpointer process
>> >> postgres 2184 0.0 0.0 134604 2844 ? Ss 03:34 0:00 postgres: writer
process
>> >> postgres 2185 0.0 0.0 134468 2780 ? Ss 03:34 0:00 postgres: wal
writer process
>> >> postgres 2186 0.0 0.0 0 0 ? Zs 03:34 0:00 [postgres] <defunct>
<<<<<<<<<<<<<<< dead process
>> >> postgres 2187 0.0 0.0 127300 2204 ? Ss 03:34 0:00 postgres: stats
collector process
>> >> postgres 2193 0.0 0.0 118164 2696 pts/0 T 03:34 0:00 pg_basebackup
-D /home/local/pg/backup -p 57833 --format=t -x
>> >> postgres 2194 0.0 0.0 134916 6016 ? Ss 03:34 0:00 postgres: wal
sender process user1 [local] sending backup "pg_basebackup base backup"
>> >
>> > That postmaster is in STOPped mode is the issue here. That doesn't
>> > happen unless you take specific action to do that.
>>
>> I hadn't noticed that. That looks like I suspended pg_ctl during start,
>> but with the backup in progress already, it's not clear how I managed
>> that state. There was no kill -SIGSTOP involved...
>
> Suspending a process *is* sending sigstop. You may not have sent
> sigstop explicitely, but the shell would have done it if you suspended
> the process.
>

I *know*. But as you can see that backup process is already underway.
That means pg_ctl had returned by then, and I had issued the
pg_basebackup command. Since I didn't manually send a SIGSTOP,
and postgres was already detached by then, I don't know how it
could have gotten suspended.

> Since pg_ctl is not normally long-lived, I'm not sure how you ended up
> suspending it.
>

exactly.

>> After killing some subprocesses in random I do see postgres
>> restarting the whole group once one goes down, if/once its
>> running/unsuspended.

>
> Well, doing things randomly is unlikely to teach you much ...
>

Well, It can teach you which electric socket will
electrocute you when poked with a fork. That's useful data.

Amir

Re: BUG #13643: Should a process dying bring postgresql down, or not?

From
Alvaro Herrera
Date:
Amir Rohan wrote:
> On 09/28/2015 12:06 AM, Alvaro Herrera wrote:
> > Amir Rohan wrote:

> >> > That postmaster is in STOPped mode is the issue here. That doesn't
> >> > happen unless you take specific action to do that.
> >>
> >> I hadn't noticed that. That looks like I suspended pg_ctl during start,
> >> but with the backup in progress already, it's not clear how I managed
> >> that state. There was no kill -SIGSTOP involved...
> >
> > Suspending a process *is* sending sigstop. You may not have sent
> > sigstop explicitely, but the shell would have done it if you suspended
> > the process.
>
> I *know*. But as you can see that backup process is already underway.
> That means pg_ctl had returned by then, and I had issued the
> pg_basebackup command. Since I didn't manually send a SIGSTOP,
> and postgres was already detached by then, I don't know how it
> could have gotten suspended.

Maybe if you do pg_ctl in a terminal and it remains there as an
unfinished job, then close the terminal, it will get sent a SIGSTOP.
I have vague recollections that stuff worked in this way.

> >> After killing some subprocesses in random I do see postgres
> >> restarting the whole group once one goes down, if/once its
> >> running/unsuspended.
>
> > Well, doing things randomly is unlikely to teach you much ...
>
> Well, It can teach you which electric socket will
> electrocute you when poked with a fork. That's useful data.

If you *learn* which one was it, you weren't doing it randomly but
systematically trying them all.  That's what I wanted to point out.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #13643: Should a process dying bring postgresql down, or not?

From
"Amir Rohan"
Date:
> Sent: Tuesday, September 29, 2015 at 12:53 AM
> From: "Alvaro Herrera" <alvherre@2ndquadrant.com>
> To: "Amir Rohan" <amir.rohan@mail.com>
> Cc: pgsql-bugs@postgresql.org
> Subject: Re: [BUGS] BUG #13643: Should a process dying bring postgresql down, or not?

> > > Well, doing things randomly is unlikely to teach you much ...
> >
> > Well, It can teach you which electric socket will
> > electrocute you when poked with a fork. That's useful data.
>
> If you *learn* which one was it, you weren't doing it randomly but
> systematically trying them all.  That's what I wanted to point out.
>

On average, I would have to poke a fork randomly in precisely one socket,
if I payed the bill. But point taken.