Thread: BUG #13900: stop standby failed with writer process hang(happen 3 times in 2 days)

BUG #13900: stop standby failed with writer process hang(happen 3 times in 2 days)

From
amutu@amutu.com
Date:
The following bug has been logged on the website:

Bug reference:      13900
Logged by:          Jov
Email address:      amutu@amutu.com
PostgreSQL version: 9.3.7
Operating system:   FreeBSD 10.2 amd64
Description:

I am updating my 3 database from pg9.3 to pg9.5,but may find a bug for the
bgwriter of pg9.3.I can't stop all the stand by process,even for immediate
stop mode and kill -9,the writer process still there,with ps state "Ds"  (D
Marks a process in disk (or other short term, uninterruptible) wait) .google
say the only method to clean the "Ds" process is rebooting the system.
truss say no info for the process,and procstat say the process is calling
the poll system call in the kernel.

These is the detail info:
pg_ctl -D ./slave stop -m fast
waiting for server to shut
down............................................................... failed
pg_ctl: server does not shut down

psql postgres
psql: FATAL:  the database system is shutting down

pg_ctl -D ./slave stop -m immediate
waiting for server to shut down.... done
server stopped

ps auxwww | grep postgres
jovz   976  0.0  0.3  28840  5232  -  Is   17 116       0:00.04 postgres:
logger process    (postgres)
jovz   979  0.0  0.7 196940 13552  -  Ds   17 116       0:06.03 postgres:
writer process    (postgres)

log:
2016-01-30 14:23:22.350 CST,,,947,,569b1bc2.3b3,3,,2016-01-17 12:42:42
CST,,0,LOG,00000,"received fast shutdown request",,,,,,,,,""
2016-01-30 14:23:22.350 CST,,,947,,569b1bc2.3b3,4,,2016-01-17 12:42:42
CST,,0,LOG,00000,"aborting any active transactions",,,,,,,,,""
2016-01-30 14:25:35.271 CST,,,64815,"",56ac575f.fd2f,1,"",2016-01-30
14:25:35 CST,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,""
2016-01-30 14:25:35.274
CST,"jovz","f",64815,"[local]",56ac575f.fd2f,2,"",2016-01-30 14:25:35
CST,,0,FATAL,57P03,"the database system is shutting down",,,,,,,,,""
2016-01-30 14:25:38.324 CST,,,64817,"",56ac5762.fd31,1,"",2016-01-30
14:25:38 CST,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,""
2016-01-30 14:25:38.324
CST,"jovz","f",64817,"[local]",56ac5762.fd31,2,"",2016-01-30 14:25:38
CST,,0,FATAL,57P03,"the database system is shutting down",,,,,,,,,""
2016-01-30 14:47:36.727 CST,,,65457,"",56ac5c88.ffb1,1,"",2016-01-30
14:47:36 CST,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,""
2016-01-30 14:47:36.727
CST,"jovz","postgres",65457,"[local]",56ac5c88.ffb1,2,"",2016-01-30 14:47:36
CST,,0,FATAL,57P03,"the database system is shutting down",,,,,,,,,""
2016-01-30 14:50:04.564 CST,,,947,,569b1bc2.3b3,5,,2016-01-17 12:42:42
CST,,0,LOG,00000,"received immediate shutdown request",,,,,,,,,""


truss -p 979

^Ctruss: Unexpect stop in waitpid: Interrupted system call
root@fblax:~ # procstat -kk 979
  PID    TID COMM             TDNAME           KSTACK

  979 100688 postgres         -                mi_switch+0xe1
sleepq_timedwait_sig+0x8b _cv_timedwait_sig_sbt+0x18b seltdwait+0xa4
kern_poll+0x464 sys_poll+0x61 amd64_syscall+0x357 Xfast_syscall+0xfb
root@fb:~ # kill -9 979
root@fb:~ # procstat -kk 979
  PID    TID COMM             TDNAME           KSTACK

  979 100688 postgres         -                mi_switch+0xe1
sleepq_timedwait_sig+0x8b _cv_timedwait_sig_sbt+0x18b seltdwait+0xa4
kern_poll+0x464 sys_poll+0x61 amd64_syscall+0x357 Xfast_syscall+0xfb
On Sat, Jan 30, 2016 at 8:13 AM,  <amutu@amutu.com> wrote:

> I am updating my 3 database from pg9.3 to pg9.5,but may find a bug for the
> bgwriter of pg9.3.I can't stop all the stand by process,even for immediate
> stop mode and kill -9,the writer process still there,with ps state "Ds"  (D
> Marks a process in disk (or other short term, uninterruptible) wait) .google
> say the only method to clean the "Ds" process is rebooting the system.
> truss say no info for the process,and procstat say the process is calling
> the poll system call in the kernel.

There is no way for PostgreSQL to cause this.  In all cases where I have
personally seen such behavior there was either failing storage hardware or a
driver with a bug (which I was always able to fix with the appropriate firmware
or driver upgrade).

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
this happens on 2 machines,and both use zfs,scrub say no error.

By now 2 machines,3 standby instances,I will try to reproduce it on linux.
2016=E5=B9=B41=E6=9C=8830=E6=97=A5 3:22 PM=EF=BC=8C"Kevin Grittner" <kgritt=
n@gmail.com>=E5=86=99=E9=81=93=EF=BC=9A

> On Sat, Jan 30, 2016 at 8:13 AM,  <amutu@amutu.com> wrote:
>
> > I am updating my 3 database from pg9.3 to pg9.5,but may find a bug for
> the
> > bgwriter of pg9.3.I can't stop all the stand by process,even for
> immediate
> > stop mode and kill -9,the writer process still there,with ps state "Ds"
> (D
> > Marks a process in disk (or other short term, uninterruptible) wait)
> .google
> > say the only method to clean the "Ds" process is rebooting the system.
> > truss say no info for the process,and procstat say the process is calli=
ng
> > the poll system call in the kernel.
>
> There is no way for PostgreSQL to cause this.  In all cases where I have
> personally seen such behavior there was either failing storage hardware o=
r
> a
> driver with a bug (which I was always able to fix with the appropriate
> firmware
> or driver upgrade).
>
> --
> Kevin Grittner
> EDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
On 1/29/2016 11:35 PM, Jov wrote:
> this happens on 2 machines,and both use zfs,scrub say no error.


I've used various versions of postgres on both solaris 10 and freebsd
9.3 (actually FreeNAS, postgresql in a jail) using zfs, without any such
problems.

do you have a reproducable case?



--
john r pierce, recycling bits in santa cruz