Re: Autovacuum seems to block database: WARNING worker took too long to start - Mailing list pgsql-admin

From Pablo Delgado Díaz-Pache
Subject Re: Autovacuum seems to block database: WARNING worker took too long to start
Date
Msg-id AANLkTi=TFc1a37h_RepkeRH9WcqCzn3PP=XuJvRWycu6@mail.gmail.com
Whole thread Raw
In response to Autovacuum seems to block database: WARNING worker took too long to start  (Pablo Delgado Díaz-Pache <delgadop@gmail.com>)
List pgsql-admin
Just noticed the mysql database (we just used it for wordpress) that we had in the same server was also "stucked" and had to restart it again. Checking the "netstat -an" I noticed the following ...

# netstat -an | grep 3306
tcp        0      0 0.0.0.0:3306                0.0.0.0:*                   LISTEN
tcp        0      0 10.19.0.47:3306             10.19.0.46:33691            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33745            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33710            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33743            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33751            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33690            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33758            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33725            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33739            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33730            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33744            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33688            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33741            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33757            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33712            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33734            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33719            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33738            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33682            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33702            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33701            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33742            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33708            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33703            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33731            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33723            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33695            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33724            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33728            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33689            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33729            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33694            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33720            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33754            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33732            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33705            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33748            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33749            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33733            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33753            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33755            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33726            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33727            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33737            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33693            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33700            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33740            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33747            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33716            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33709            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33697            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33717            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33736            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33715            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33704            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33711            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33746            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33718            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33692            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33707            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33696            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33750            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33681            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33699            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33756            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33714            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33706            SYN_RECV
tcp        1      0 10.19.0.47:3306             10.19.0.46:49437            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49436            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49439            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49438            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49433            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49432            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49435            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49434            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49429            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49428            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49431            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49430            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49427            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49453            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49452            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49455            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49454            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49449            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49448            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49451            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49450            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49445            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49444            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49447            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49446            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49441            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49440            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49443            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49442            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49457            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49456            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:30397            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:30396            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:30399            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:30398            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:30393            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:30392            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:30395            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:30394            TIME_WAIT
tcp      100      0 10.19.0.47:3306             10.19.0.46:30391            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:30409            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:30408            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:30411            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:30410            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:30405            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:30404            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:30407            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:30406            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:30401            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:30400            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:30403            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:30402            TIME_WAIT

Didn't do the netstat -an for port 5432, but I guess all max_connections for postgres were busy. Just a guess ...

All connections are done from internal servers, and there are no firewall rules in that interface. Just in case here it goes the iptables -L -n -v

# iptables -L -n -v
Chain INPUT (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination
 219M   56G mirai      all  --  *      *       0.0.0.0/0            0.0.0.0/0

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain OUTPUT (policy ACCEPT 170M packets, 392G bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain mirai (1 references)
 pkts bytes target     prot opt in     out     source               destination
  12M 6207M ACCEPT     all  --  lo     *       0.0.0.0/0            0.0.0.0/0
 6186  509K ACCEPT     icmp --  *      *       0.0.0.0/0            0.0.0.0/0           icmp type 255
 205M   50G ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0           state RELATED,ESTABLISHED
   30  1800 ACCEPT     all  --  eth0   *       x.x.x.x        0.0.0.0/0
1023K   61M ACCEPT     all  --  eth1   *       10.0.0.0/8           0.0.0.0/0
 6224  373K ACCEPT     tcp  --  eth0   *       0.0.0.0/0            0.0.0.0/0           tcp multiport dports 80,443
 308K  183M REJECT     all  --  *      *       0.0.0.0/0            0.0.0.0/0           reject-with icmp-port-unreachable


2010/11/18 Pablo Delgado Díaz-Pache <delgadop@gmail.com>
Well, we had the problem again. This is what we did

1) A "ps" to check what processes were running. Many "postmaster" processes in defunct state. A few postgres connections still working. A few "psql" (by shell script) queries hanged (they are scripts to monitor state of database. We use big-sister to do so).

The normal state of the server is 1 postmaster pid and many "postgres" connections in "idle" state.

This the result when we were having the problem. I've excluded irrelevant processes.

postgres   892  3889  0 09:07 ?        00:01:05 postgres: postgres international 10.19.0.51(49264) idle
postgres   934  3889  0 12:00 ?        00:00:04 [postmaster] <defunct>
postgres   935  3889  0 12:00 ?        00:00:04 [postmaster] <defunct>
postgres   936  3889  0 12:00 ?        00:00:00 [postmaster] <defunct>
postgres   938  3889  0 12:00 ?        00:00:01 [postmaster] <defunct>
postgres  1244  3889  0 12:01 ?        00:00:01 [postmaster] <defunct>
postgres  1453  3889  0 12:02 ?        00:00:02 [postmaster] <defunct>
postgres  1563  3889  0 12:03 ?        00:00:10 [postmaster] <defunct>
postgres  1862  3889  0 12:04 ?        00:00:04 [postmaster] <defunct>
postgres  2275  3889  0 12:06 ?        00:00:02 [postmaster] <defunct>
root      2287    49  0 Oct29 ?        00:00:00 [kondemand/0]
root      2288    49  0 Oct29 ?        00:00:00 [kondemand/1]
postgres  2485  3889  0 12:07 ?        00:00:05 postgres: postgres international 10.19.0.51(59066) idle
postgres  2490  3889  1 12:07 ?        00:00:26 postgres: postgres international 10.19.0.51(59070) idle
postgres  2568  3889  0 12:07 ?        00:00:00 postgres: postgres international 10.19.0.50(1249) idle
postgres  3544  3889  0 12:12 ?        00:00:07 [postmaster] <defunct>
postgres  3875  3889  0 12:14 ?        00:00:03 [postmaster] <defunct>
postgres  3889     1  0 Nov16 ?        00:01:24 /usr/bin/postmaster -p 5432 -D /var/lib/pgsql/data
postgres  3896  3889  0 Nov16 ?        00:00:00 postgres: logger process
postgres  3913  3889  0 Nov16 ?        00:00:05 postgres: writer process
postgres  3914  3889  0 Nov16 ?        00:00:00 postgres: wal writer process
postgres  3915  3889  0 Nov16 ?        00:00:01 postgres: autovacuum launcher process
postgres  3916  3889  0 Nov16 ?        00:01:42 postgres: stats collector process
postgres  4169  3889  0 12:15 ?        00:00:02 [postmaster] <defunct>
postgres  4170  3889  0 12:15 ?        00:00:04 [postmaster] <defunct>
postgres  4171  3889  0 12:15 ?        00:00:07 [postmaster] <defunct>
postgres  4281  3889  0 12:15 ?        00:00:04 [postmaster] <defunct>
postgres  4672  3889  0 12:17 ?        00:00:01 [postmaster] <defunct>
postgres  5038  3889  0 12:19 ?        00:00:01 [postmaster] <defunct>
postgres  5203  3889  0 12:20 ?        00:00:00 [postmaster] <defunct>
postgres  5484  3889  0 12:21 ?        00:00:01 [postmaster] <defunct>
postgres  5694  3889  0 12:22 ?        00:00:01 postgres: postgres international 10.19.0.49(29113) idle
postgres  5769  3889  0 12:23 ?        00:00:04 [postmaster] <defunct>
postgres  5832  3889  0 12:23 ?        00:00:03 postgres: postgres international 10.19.0.49(29129) idle
postgres  5835  3889  0 12:23 ?        00:00:02 [postmaster] <defunct>
postgres  5979  3889  0 12:24 ?        00:00:01 [postmaster] <defunct>
postgres  6243  3889  0 12:25 ?        00:00:00 [postmaster] <defunct>
postgres  6424  3889  0 12:26 ?        00:00:01 postgres: postgres international 10.19.0.49(45160) idle
postgres  6425  3889  0 12:26 ?        00:00:01 [postmaster] <defunct>
postgres  6429  3889  0 12:26 ?        00:00:01 [postmaster] <defunct>
postgres  6599  3889  0 12:27 ?        00:00:02 [postmaster] <defunct>
postgres  6635  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6653  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6654  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6655  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6663  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6664  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6665  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6666  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6667  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6668  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6669  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6670  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6671  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6672  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6673  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6773  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6774  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6778  3889  0 12:28 ?        00:00:01 [postmaster] <defunct>
postgres  6779  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6785  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6787  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6788  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6796  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6797  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6837  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6838  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6839  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6840  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6841  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6842  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6843  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6844  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6852  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6859  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6860  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6862  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6872  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6873  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6874  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6910  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  6970  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  6971  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  6972  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  6973  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  6978  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  6990  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  6991  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  6992  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  6993  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  7000  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  7088  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  7089  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  7090  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  7091  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  7094  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  7095  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  7096  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  7122  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  7126  3889  0 12:29 ?        00:00:03 postgres: postgres mrs 10.19.0.51(7717) idle
postgres  7127  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
root      7152  7150  0 12:30 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7157  7155  0 12:30 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7168  7166  0 12:31 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7174  7172  0 12:31 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7188  7186  0 12:32 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7196  7194  0 12:32 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7258  7256  0 12:34 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7263  7261  0 12:34 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7308  7306  0 12:35 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7313  7311  0 12:35 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7325  7323  0 12:36 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7330  7328  0 12:36 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7371  7369  0 12:37 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
postgres 22959  3889  0 11:09 ?        00:00:01 postgres: postgres spain 10.19.0.45(57042) idle
postgres 22961  3889  0 11:09 ?        00:00:22 postgres: postgres mirai 10.19.0.45(57043) idle
postgres 23225  3889  0 11:10 ?        00:00:00 postgres: postgres xmas 10.19.0.45(57061) idle
postgres 25844  3889  0 11:23 ?        00:00:01 postgres: postgres spain 10.19.0.51(11291) idle
postgres 31942  3889  0 11:53 ?        00:00:00 [postmaster] <defunct>
postgres 31973  3889  0 11:53 ?        00:00:16 [postmaster] <defunct>

2) We did a strace to the postmaster pid. However we had 2 postmasters not dead

# ps -fea |grep -i postmaster
postgres  3889     1  0 Nov16 ?        00:01:24 /usr/bin/postmaster -p 5432 -D /var/lib/pgsql/data
postgres  7601  3889  0 12:37 ?        00:00:00 /usr/bin/postmaster -p 5432 -D /var/lib/pgsql/data

As soon as we did a "strace" to the 3889 pid everything started to work again. Not sure it was a coincidence but that was the way it was.

# strace -p 3889
Process 3889 attached - interrupt to quit
select(6, [3 4 5], NULL, NULL, {56, 930000}) = ? ERESTARTNOHAND (To be restarted)
--- SIGUSR1 (User defined signal 1) @ 0 (0) ---
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b4f6591d440) = 7766
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigreturn(0x2)                       = -1 EINTR (Interrupted system call)
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
select(6, [3 4 5], NULL, NULL, {60, 0}) = ? ERESTARTNOHAND (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 7766
wait4(-1, 0x7fffc37259a4, WNOHANG, NULL) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigreturn(0x2)                       = -1 EINTR (Interrupted system call)
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
select(6, [3 4 5], NULL, NULL, {60, 0}) = ? ERESTARTNOHAND (To be restarted)
--- SIGUSR1 (User defined signal 1) @ 0 (0) ---
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b4f6591d440) = 7767
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigreturn(0x2)                       = -1 EINTR (Interrupted system call)
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
select(6, [3 4 5], NULL, NULL, {60, 0}) = ? ERESTARTNOHAND (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---

I also straced the other postmaster pid

# strace -p 7601
Process 7601 attached - interrupt to quit
recvfrom(8, "P\0\0\0\221\0select id_key from transla"..., 8192, 0, NULL, NULL) = 181
brk(0x14ac4000)                         = 0x14ac4000
lseek(3, 0, SEEK_END)                   = 16130048
lseek(5, 0, SEEK_END)                   = 4661248
lseek(3, 0, SEEK_END)                   = 16130048
brk(0x14ae9000)                         = 0x14ae9000
brk(0x14adb000)                         = 0x14adb000
brk(0x14ad7000)                         = 0x14ad7000
sendto(7, "\2\0\0\0\230\1\0\0H\203\314\0\4\0\0\0\5\0\0\0\0\0\0\0X\203\314\0\0\0\0\0"..., 408, 0, NULL, 0) = 408
sendto(8, "1\0\0\0\0042\0\0\0\4T\0\0\0\37\0\1id_key\0\0\314\203X\0\4\0\0"..., 74, 0, NULL, 0) = 74
recvfrom(8, "P\0\0\0\213\0select id_key from transla"..., 8192, 0, NULL, NULL) = 175
brk(0x14ac4000)                         = 0x14ac4000
lseek(3, 0, SEEK_END)                   = 16130048
lseek(5, 0, SEEK_END)                   = 4661248
lseek(3, 0, SEEK_END)                   = 16130048
brk(0x14ae9000)                         = 0x14ae9000
brk(0x14adb000)                         = 0x14adb000
brk(0x14ad7000)                         = 0x14ad7000
sendto(8, "1\0\0\0\0042\0\0\0\4T\0\0\0\37\0\1id_key\0\0\314\203X\0\4\0\0"..., 73, 0, NULL, 0) = 73
recvfrom(8,  <unfinished ...>
Process 7601 detached

3) Checked the logs. We could see the same error lines we had before, but they dissapeared after just a few minutes.

2010-11-18 12:30:02.830 CET|1|mirai|10.19.0.43(4452)|3875|idle|4ce50a7c.f23|2010-11-18 12:14:04 CET|67/0|0|postgres| LOG:  unexpected EOF on client connection
2010-11-18 12:30:02.841 CET|1|mirai|10.19.0.43(13943)|2275|idle|4ce508b4.8e3|2010-11-18 12:06:28 CET|75/0|0|postgres| LOG:  unexpected EOF on client connection
2010-11-18 12:30:47.628 CET|2|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING:  worker took too long to start; cancelled
2010-11-18 12:31:48.056 CET|3|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING:  worker took too long to start; cancelled
2010-11-18 12:32:48.184 CET|4|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING:  worker took too long to start; cancelled
2010-11-18 12:33:48.614 CET|5|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING:  worker took too long to start; cancelled
2010-11-18 12:34:48.747 CET|6|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING:  worker took too long to start; cancelled
2010-11-18 12:35:49.277 CET|7|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING:  worker took too long to start; cancelled
2010-11-18 12:36:49.706 CET|8|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING:  worker took too long to start; cancelled
2010-11-18 12:37:30.462 CET|1|spain|10.19.0.48(11276)|7414|idle|4ce50ffa.1cf6|2010-11-18 12:37:30 CET|2/0|0|spainadmin| LOG:  unexpected EOF on client connection

Any Ideas?

Thanks for your help and feedback!

Pablo

2010/11/16 Alvaro Herrera <alvherre@commandprompt.com>
Excerpts from Pablo Delgado Díaz-Pache's message of mar nov 16 12:18:09 -0300 2010:

> Not sure what you mean.
>
> Once we start getting that error in the postgres log, there is no autovacuum
> entry. Only that error message once every minute (and as i wrote in my last
> email, the same message all the time).

Yeah.  This means that the autovacuum launcher daemon is still running
and trying to schedule new jobs, but it's not able to for some reason.
When it tries to start a new one, it sees the old one still waiting to
start running; and then it logs this message.

To start a new job, what it does is send a signal to postmaster and wait
for it to create a child process.  This is the part that's not working.

> The question is what is causing postgres to stop working. Autovacuum or
> another reason? If it's another reason ... what could it be?

I have no idea what's going on.  I have never seen this problem.  If you
see it happen again, please get a stacktrace from postmaster; and/or
strace it.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


pgsql-admin by date:

Previous
From: Mark Felder
Date:
Subject: Re: ZFS-FreeBSD + postgresql performance
Next
From: Pablo Delgado Díaz-Pache
Date:
Subject: Re: Autovacuum seems to block database: WARNING worker took too long to start