Thread: BUG #13643: Should a process dying bring postgresql down, or not?
The following bug has been logged on the website: Bug reference: 13643 Logged by: Amir Rohan Email address: amir.rohan@mail.com PostgreSQL version: 9.5alpha2 Operating system: Linux Description: Seen on my box: postgres 2181 0.0 0.1 134468 9504 pts/0 T 03:34 0:00 /usr/local/pgsql/bin/postgres -D /home/local/pg/s1 postgres 2183 0.0 0.0 134576 4168 ? Ss 03:34 0:00 postgres: checkpointer process postgres 2184 0.0 0.0 134604 2844 ? Ss 03:34 0:00 postgres: writer process postgres 2185 0.0 0.0 134468 2780 ? Ss 03:34 0:00 postgres: wal writer process postgres 2186 0.0 0.0 0 0 ? Zs 03:34 0:00 [postgres] <defunct> <<<<<<<<<<<<<<< dead process postgres 2187 0.0 0.0 127300 2204 ? Ss 03:34 0:00 postgres: stats collector process postgres 2193 0.0 0.0 118164 2696 pts/0 T 03:34 0:00 pg_basebackup -D /home/local/pg/backup -p 57833 --format=t -x postgres 2194 0.0 0.0 134916 6016 ? Ss 03:34 0:00 postgres: wal sender process user1 [local] sending backup "pg_basebackup base backup" Not sure if this is a real problem or not, but it was my understanding that pg panics when a subprocess dies for safety resons.
amir.rohan@mail.com wrote: > postgres 2181 0.0 0.1 134468 9504 pts/0 T 03:34 0:00 /usr/local/pgsql/bin/postgres -D /home/local/pg/s1 > postgres 2183 0.0 0.0 134576 4168 ? Ss 03:34 0:00 postgres: checkpointer process > postgres 2184 0.0 0.0 134604 2844 ? Ss 03:34 0:00 postgres: writer process > postgres 2185 0.0 0.0 134468 2780 ? Ss 03:34 0:00 postgres: wal writer process > postgres 2186 0.0 0.0 0 0 ? Zs 03:34 0:00 [postgres] <defunct> <<<<<<<<<<<<<<< dead process > postgres 2187 0.0 0.0 127300 2204 ? Ss 03:34 0:00 postgres: stats collector process > postgres 2193 0.0 0.0 118164 2696 pts/0 T 03:34 0:00 pg_basebackup -D /home/local/pg/backup -p 57833 --format=t-x > postgres 2194 0.0 0.0 134916 6016 ? Ss 03:34 0:00 postgres: wal sender process user1 [local] sendingbackup "pg_basebackup base backup" > > Not sure if this is a real problem or not, but it was my understanding that > pg panics when a subprocess dies for safety resons. A zombie process is a process that died and hasn't been collected by its parent process. In this case, postmaster is stopped ("T" above), so it cannot call wait() to collect the dead process. Once you signal postmaster to run again, it will either discover that the process died cleanly (and clean up state and all is well), or that it died uncleanly (in which case it will cause all other processes to stop). That postmaster is in STOPped mode is the issue here. That doesn't happen unless you take specific action to do that. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 09/27/2015 09:59 PM, Alvaro Herrera wrote: > amir.rohan@mail.com wrote: > >> postgres 2181 0.0 0.1 134468 9504 pts/0 T 03:34 0:00 /usr/local/pgsql/bin/postgres -D /home/local/pg/s1 >> postgres 2183 0.0 0.0 134576 4168 ? Ss 03:34 0:00 postgres: checkpointer process >> postgres 2184 0.0 0.0 134604 2844 ? Ss 03:34 0:00 postgres: writer process >> postgres 2185 0.0 0.0 134468 2780 ? Ss 03:34 0:00 postgres: wal writer process >> postgres 2186 0.0 0.0 0 0 ? Zs 03:34 0:00 [postgres] <defunct> <<<<<<<<<<<<<<< deadprocess >> postgres 2187 0.0 0.0 127300 2204 ? Ss 03:34 0:00 postgres: stats collector process >> postgres 2193 0.0 0.0 118164 2696 pts/0 T 03:34 0:00 pg_basebackup -D /home/local/pg/backup -p 57833 --format=t-x >> postgres 2194 0.0 0.0 134916 6016 ? Ss 03:34 0:00 postgres: wal sender process user1 [local] sendingbackup "pg_basebackup base backup" > > That postmaster is in STOPped mode is the issue here. That doesn't > happen unless you take specific action to do that. > I hadn't noticed that. That looks like I suspended pg_ctl during start, but with the backup in progress already, it's not clear how I managed that state. There was no kill -SIGSTOP involved... After killing some subprocesses in random I do see postgres restarting the whole group once one goes down, if/once its running/unsuspended. Excuse the noise. Amir
Amir Rohan wrote: > On 09/27/2015 09:59 PM, Alvaro Herrera wrote: > > amir.rohan@mail.com wrote: > > > >> postgres 2181 0.0 0.1 134468 9504 pts/0 T 03:34 0:00 /usr/local/pgsql/bin/postgres -D /home/local/pg/s1 > >> postgres 2183 0.0 0.0 134576 4168 ? Ss 03:34 0:00 postgres: checkpointer process > >> postgres 2184 0.0 0.0 134604 2844 ? Ss 03:34 0:00 postgres: writer process > >> postgres 2185 0.0 0.0 134468 2780 ? Ss 03:34 0:00 postgres: wal writer process > >> postgres 2186 0.0 0.0 0 0 ? Zs 03:34 0:00 [postgres] <defunct> <<<<<<<<<<<<<<< deadprocess > >> postgres 2187 0.0 0.0 127300 2204 ? Ss 03:34 0:00 postgres: stats collector process > >> postgres 2193 0.0 0.0 118164 2696 pts/0 T 03:34 0:00 pg_basebackup -D /home/local/pg/backup -p 57833--format=t -x > >> postgres 2194 0.0 0.0 134916 6016 ? Ss 03:34 0:00 postgres: wal sender process user1 [local] sendingbackup "pg_basebackup base backup" > > > > That postmaster is in STOPped mode is the issue here. That doesn't > > happen unless you take specific action to do that. > > I hadn't noticed that. That looks like I suspended pg_ctl during start, > but with the backup in progress already, it's not clear how I managed > that state. There was no kill -SIGSTOP involved... Suspending a process *is* sending sigstop. You may not have sent sigstop explicitely, but the shell would have done it if you suspended the process. Since pg_ctl is not normally long-lived, I'm not sure how you ended up suspending it. > After killing some subprocesses in random I do see postgres > restarting the whole group once one goes down, if/once its > running/unsuspended. Well, doing things randomly is unlikely to teach you much ... -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> > > Sent: Monday, September 28, 2015 at 12:06 AM > From: "Alvaro Herrera" <alvherre@2ndquadrant.com> > To: "Amir Rohan" <amir.rohan@mail.com> > Cc: pgsql-bugs@postgresql.org > Subject: Re: BUG #13643: Should a process dying bring postgresql down, or not? > Amir Rohan wrote: >> On 09/27/2015 09:59 PM, Alvaro Herrera wrote: >> > amir.rohan@mail.com wrote: >> > >> >> postgres 2181 0.0 0.1 134468 9504 pts/0 T 03:34 0:00 /usr/local/pgsql/bin/postgres -D /home/local/pg/s1 >> >> postgres 2183 0.0 0.0 134576 4168 ? Ss 03:34 0:00 postgres: checkpointer process >> >> postgres 2184 0.0 0.0 134604 2844 ? Ss 03:34 0:00 postgres: writer process >> >> postgres 2185 0.0 0.0 134468 2780 ? Ss 03:34 0:00 postgres: wal writer process >> >> postgres 2186 0.0 0.0 0 0 ? Zs 03:34 0:00 [postgres] <defunct> <<<<<<<<<<<<<<<dead process >> >> postgres 2187 0.0 0.0 127300 2204 ? Ss 03:34 0:00 postgres: stats collector process >> >> postgres 2193 0.0 0.0 118164 2696 pts/0 T 03:34 0:00 pg_basebackup -D /home/local/pg/backup -p 57833 --format=t-x >> >> postgres 2194 0.0 0.0 134916 6016 ? Ss 03:34 0:00 postgres: wal sender process user1 [local] sending backup"pg_basebackup base backup" >> > >> > That postmaster is in STOPped mode is the issue here. That doesn't >> > happen unless you take specific action to do that. >> >> I hadn't noticed that. That looks like I suspended pg_ctl during start, >> but with the backup in progress already, it's not clear how I managed >> that state. There was no kill -SIGSTOP involved... > > Suspending a process *is* sending sigstop. You may not have sent > sigstop explicitely, but the shell would have done it if you suspended > the process. > I *know*. But as you can see that backup process is already underway. That means pg_ctl had returned by then, and I had issued the pg_basebackup command. Since I didn't manually send a SIGSTOP, and postgres was already detached by then, I don't know how it could have gotten suspended. > Since pg_ctl is not normally long-lived, I'm not sure how you ended up > suspending it. > exactly. >> After killing some subprocesses in random I do see postgres >> restarting the whole group once one goes down, if/once its >> running/unsuspended. > > Well, doing things randomly is unlikely to teach you much ... > Well, It can teach you which electric socket will electrocute you when poked with a fork. That's useful data. Amir
On 09/28/2015 12:06 AM, Alvaro Herrera wrote: > Amir Rohan wrote: >> On 09/27/2015 09:59 PM, Alvaro Herrera wrote: >>> amir.rohan@mail.com wrote: >>> >>>> postgres 2181 0.0 0.1 134468 9504 pts/0 T 03:34 0:00 /usr/local/pgsql/bin/postgres -D /home/local/pg/s1 >>>> postgres 2183 0.0 0.0 134576 4168 ? Ss 03:34 0:00 postgres: checkpointer process >>>> postgres 2184 0.0 0.0 134604 2844 ? Ss 03:34 0:00 postgres: writer process >>>> postgres 2185 0.0 0.0 134468 2780 ? Ss 03:34 0:00 postgres: wal writer process >>>> postgres 2186 0.0 0.0 0 0 ? Zs 03:34 0:00 [postgres] <defunct> <<<<<<<<<<<<<<< deadprocess >>>> postgres 2187 0.0 0.0 127300 2204 ? Ss 03:34 0:00 postgres: stats collector process >>>> postgres 2193 0.0 0.0 118164 2696 pts/0 T 03:34 0:00 pg_basebackup -D /home/local/pg/backup -p 57833--format=t -x >>>> postgres 2194 0.0 0.0 134916 6016 ? Ss 03:34 0:00 postgres: wal sender process user1 [local] sendingbackup "pg_basebackup base backup" >>> >>> That postmaster is in STOPped mode is the issue here. That doesn't >>> happen unless you take specific action to do that. >> >> I hadn't noticed that. That looks like I suspended pg_ctl during start, >> but with the backup in progress already, it's not clear how I managed >> that state. There was no kill -SIGSTOP involved... > > Suspending a process *is* sending sigstop. You may not have sent > sigstop explicitely, but the shell would have done it if you suspended > the process. > > Since pg_ctl is not normally long-lived, I'm not sure how you ended up > suspending it. > >> After killing some subprocesses in random I do see postgres >> restarting the whole group once one goes down, if/once its >> running/unsuspended. > > Well, doing things randomly is unlikely to teach you much ... > Pardon my earlier HTML response, I had to use the webmail interface at the time. Sending again as text. > > > Sent: Monday, September 28, 2015 at 12:06 AM > From: "Alvaro Herrera" <alvherre@2ndquadrant.com> > To: "Amir Rohan" <amir.rohan@mail.com> > Cc: pgsql-bugs@postgresql.org > Subject: Re: BUG #13643: Should a process dying bring postgresql down, or not? > Amir Rohan wrote: >> On 09/27/2015 09:59 PM, Alvaro Herrera wrote: >> > amir.rohan@mail.com wrote: >> > >> >> postgres 2181 0.0 0.1 134468 9504 pts/0 T 03:34 0:00 /usr/local/pgsql/bin/postgres -D /home/local/pg/s1 >> >> postgres 2183 0.0 0.0 134576 4168 ? Ss 03:34 0:00 postgres: checkpointer process >> >> postgres 2184 0.0 0.0 134604 2844 ? Ss 03:34 0:00 postgres: writer process >> >> postgres 2185 0.0 0.0 134468 2780 ? Ss 03:34 0:00 postgres: wal writer process >> >> postgres 2186 0.0 0.0 0 0 ? Zs 03:34 0:00 [postgres] <defunct> <<<<<<<<<<<<<<< dead process >> >> postgres 2187 0.0 0.0 127300 2204 ? Ss 03:34 0:00 postgres: stats collector process >> >> postgres 2193 0.0 0.0 118164 2696 pts/0 T 03:34 0:00 pg_basebackup -D /home/local/pg/backup -p 57833 --format=t -x >> >> postgres 2194 0.0 0.0 134916 6016 ? Ss 03:34 0:00 postgres: wal sender process user1 [local] sending backup "pg_basebackup base backup" >> > >> > That postmaster is in STOPped mode is the issue here. That doesn't >> > happen unless you take specific action to do that. >> >> I hadn't noticed that. That looks like I suspended pg_ctl during start, >> but with the backup in progress already, it's not clear how I managed >> that state. There was no kill -SIGSTOP involved... > > Suspending a process *is* sending sigstop. You may not have sent > sigstop explicitely, but the shell would have done it if you suspended > the process. > I *know*. But as you can see that backup process is already underway. That means pg_ctl had returned by then, and I had issued the pg_basebackup command. Since I didn't manually send a SIGSTOP, and postgres was already detached by then, I don't know how it could have gotten suspended. > Since pg_ctl is not normally long-lived, I'm not sure how you ended up > suspending it. > exactly. >> After killing some subprocesses in random I do see postgres >> restarting the whole group once one goes down, if/once its >> running/unsuspended. > > Well, doing things randomly is unlikely to teach you much ... > Well, It can teach you which electric socket will electrocute you when poked with a fork. That's useful data. Amir
Amir Rohan wrote: > On 09/28/2015 12:06 AM, Alvaro Herrera wrote: > > Amir Rohan wrote: > >> > That postmaster is in STOPped mode is the issue here. That doesn't > >> > happen unless you take specific action to do that. > >> > >> I hadn't noticed that. That looks like I suspended pg_ctl during start, > >> but with the backup in progress already, it's not clear how I managed > >> that state. There was no kill -SIGSTOP involved... > > > > Suspending a process *is* sending sigstop. You may not have sent > > sigstop explicitely, but the shell would have done it if you suspended > > the process. > > I *know*. But as you can see that backup process is already underway. > That means pg_ctl had returned by then, and I had issued the > pg_basebackup command. Since I didn't manually send a SIGSTOP, > and postgres was already detached by then, I don't know how it > could have gotten suspended. Maybe if you do pg_ctl in a terminal and it remains there as an unfinished job, then close the terminal, it will get sent a SIGSTOP. I have vague recollections that stuff worked in this way. > >> After killing some subprocesses in random I do see postgres > >> restarting the whole group once one goes down, if/once its > >> running/unsuspended. > > > Well, doing things randomly is unlikely to teach you much ... > > Well, It can teach you which electric socket will > electrocute you when poked with a fork. That's useful data. If you *learn* which one was it, you weren't doing it randomly but systematically trying them all. That's what I wanted to point out. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> Sent: Tuesday, September 29, 2015 at 12:53 AM > From: "Alvaro Herrera" <alvherre@2ndquadrant.com> > To: "Amir Rohan" <amir.rohan@mail.com> > Cc: pgsql-bugs@postgresql.org > Subject: Re: [BUGS] BUG #13643: Should a process dying bring postgresql down, or not? > > > Well, doing things randomly is unlikely to teach you much ... > > > > Well, It can teach you which electric socket will > > electrocute you when poked with a fork. That's useful data. > > If you *learn* which one was it, you weren't doing it randomly but > systematically trying them all. That's what I wanted to point out. > On average, I would have to poke a fork randomly in precisely one socket, if I payed the bill. But point taken.