Thread: Weirdness with the stats collector process
One of our instances has been behaving -- oddly. Most queries are blazing fast. It appears to just be some of the stat views that are slow.
Queries against the following views are quick: pg_stat_activity, pg_stat_xact_all_tables, pg_stat_xact_sys_tables, pg_stat_xact_user_tables, pg_statio_sys_sequences (no system sequences), pg_stat_xact_user_functions, and pg_stat_replication.
When I query one of the other statistics views, the queries take roughly 10 seconds when they used to take milliseconds. Here's an explain of one such query: https://explain.depesz.com/s/IVEu
Logs show 'using stale statistics instead of current ones because stats collector is not responding' when I query the stats views.
CPU usage is very low. Disk I/O looks okay to me.
-bash-4.1$ iostat
Linux 2.6.32-573.12.1.el6.x86_64 (hostname) 07/25/2016 _x86_64_ (16 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
2.80 0.00 0.15 0.01 0.00 97.04
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 0.45 4.85 7.45 33142948 50914838
sdb 31.48 1403.57 713.55 9597023154 4878965216
dm-0 0.14 0.58 0.56 3949720 3847592
dm-1 1.00 4.26 6.83 29155482 46727688
Most of the files in $PGDATA/pg_stat_tmp haven't been updated by PostgreSQL since July 7th at 14:16. The one exception is pgss_query_texts.stat. I shouldn't have but I touched $PGDATA/pg_stat_tmp/db* on July 12th at 15:14 (doing so did trigger an autovacuum though).
I set track_activities and track_counts to off and reloaded config (pg_ctl reload). The log indicated that it noticed the change. I created a directory /dev/shm/pg_stat_tmp, set track_activities and track_counts on, set stats_temp_directory to '/dev/shm/pg_stat_tmp' and reloaded config. Again, the log indicated that it noticed the change but nothing happened with stats. The new directory is empty.
I notified the statistics collector to quit (kill -QUIT). It did and the postmaster restarted it. The new stats directory is still empty. The only updates to the old stats directory is the pgss_query_texts.stat file.
I tried getting stack traces and the like with gdb and strace but it wasn't really helpful.
Most of what I have tried was the result of help by Jim Nasby on postgresteam.slack.com. He and Lukas Fittl suggested that I post about my problem. There's one thing that we all agree on... It should not be so hard to figure out what is going on with the statistics collector.
I tried turning stats off and on again today. Here's what the two pg_stat_tmp directories look like today:
so now the directory looks like:
-bash-4.1$ ls -alh 9.4/data/pg_stat_tmp/
total 4.3M
drwx------ 2 postgres postgres 4.0K Jul 7 14:16 .
drwx------ 19 postgres postgres 4.0K Jul 25 11:20 ..
-rw------- 1 postgres postgres 3.4K Jul 12 15:14 db_0.stat
-rw------- 1 postgres postgres 26K Jul 12 15:14 db_13003.stat
-rw------- 1 postgres postgres 13K Jul 12 15:14 db_1.stat
-rw------- 1 postgres postgres 140K Jul 12 15:14 db_2473933.stat
-rw------- 1 postgres postgres 203K Jul 12 15:14 db_2476698.stat
-rw------- 1 postgres postgres 27K Jul 12 15:14 db_2595334.stat
-rw------- 1 postgres postgres 28K Jul 12 15:14 db_2631762.stat
-rw------- 1 postgres postgres 1.5K Jul 7 14:16 global.stat
-rw------- 1 postgres postgres 3.9M Jul 25 13:39 pgss_query_texts.stat
-bash-4.1$ ls -alh /dev/shm/pg_stat_tmp/
total 0
drwxr-xr-x 2 postgres postgres 40 Jul 12 17:19 .
drwxrwxrwt 3 root root 80 Jul 11 17:31 ..
Does anyone have any suggestions on how to:
- see what the statistics collector is doing?
- tell the postmaster to give the statistics collector the up to date config?
- get stats working without restarting the instance? (I am not yet sure when that can happen but it would have to be scheduled [in advance] to occur during an after hours maintenance window.
Thanks in advance,
Matt
Queries against the following views are quick: pg_stat_activity, pg_stat_xact_all_tables, pg_stat_xact_sys_tables, pg_stat_xact_user_tables, pg_statio_sys_sequences (no system sequences), pg_stat_xact_user_functions, and pg_stat_replication.
When I query one of the other statistics views, the queries take roughly 10 seconds when they used to take milliseconds. Here's an explain of one such query: https://explain.depesz.com/s/IVEu
Logs show 'using stale statistics instead of current ones because stats collector is not responding' when I query the stats views.
CPU usage is very low. Disk I/O looks okay to me.
-bash-4.1$ iostat
Linux 2.6.32-573.12.1.el6.x86_64 (hostname) 07/25/2016 _x86_64_ (16 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
2.80 0.00 0.15 0.01 0.00 97.04
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 0.45 4.85 7.45 33142948 50914838
sdb 31.48 1403.57 713.55 9597023154 4878965216
dm-0 0.14 0.58 0.56 3949720 3847592
dm-1 1.00 4.26 6.83 29155482 46727688
Most of the files in $PGDATA/pg_stat_tmp haven't been updated by PostgreSQL since July 7th at 14:16. The one exception is pgss_query_texts.stat. I shouldn't have but I touched $PGDATA/pg_stat_tmp/db* on July 12th at 15:14 (doing so did trigger an autovacuum though).
I set track_activities and track_counts to off and reloaded config (pg_ctl reload). The log indicated that it noticed the change. I created a directory /dev/shm/pg_stat_tmp, set track_activities and track_counts on, set stats_temp_directory to '/dev/shm/pg_stat_tmp' and reloaded config. Again, the log indicated that it noticed the change but nothing happened with stats. The new directory is empty.
I notified the statistics collector to quit (kill -QUIT). It did and the postmaster restarted it. The new stats directory is still empty. The only updates to the old stats directory is the pgss_query_texts.stat file.
I tried getting stack traces and the like with gdb and strace but it wasn't really helpful.
Most of what I have tried was the result of help by Jim Nasby on postgresteam.slack.com. He and Lukas Fittl suggested that I post about my problem. There's one thing that we all agree on... It should not be so hard to figure out what is going on with the statistics collector.
I tried turning stats off and on again today. Here's what the two pg_stat_tmp directories look like today:
so now the directory looks like:
-bash-4.1$ ls -alh 9.4/data/pg_stat_tmp/
total 4.3M
drwx------ 2 postgres postgres 4.0K Jul 7 14:16 .
drwx------ 19 postgres postgres 4.0K Jul 25 11:20 ..
-rw------- 1 postgres postgres 3.4K Jul 12 15:14 db_0.stat
-rw------- 1 postgres postgres 26K Jul 12 15:14 db_13003.stat
-rw------- 1 postgres postgres 13K Jul 12 15:14 db_1.stat
-rw------- 1 postgres postgres 140K Jul 12 15:14 db_2473933.stat
-rw------- 1 postgres postgres 203K Jul 12 15:14 db_2476698.stat
-rw------- 1 postgres postgres 27K Jul 12 15:14 db_2595334.stat
-rw------- 1 postgres postgres 28K Jul 12 15:14 db_2631762.stat
-rw------- 1 postgres postgres 1.5K Jul 7 14:16 global.stat
-rw------- 1 postgres postgres 3.9M Jul 25 13:39 pgss_query_texts.stat
-bash-4.1$ ls -alh /dev/shm/pg_stat_tmp/
total 0
drwxr-xr-x 2 postgres postgres 40 Jul 12 17:19 .
drwxrwxrwt 3 root root 80 Jul 11 17:31 ..
Does anyone have any suggestions on how to:
- see what the statistics collector is doing?
- tell the postmaster to give the statistics collector the up to date config?
- get stats working without restarting the instance? (I am not yet sure when that can happen but it would have to be scheduled [in advance] to occur during an after hours maintenance window.
Thanks in advance,
Matt
--
Matthew Musgrove
Senior Software Engineer
Assurant Mortgage Solutions
817.788.4482
mmusgrove@emortgagelogic.com
matthew.musgrove@assurant.com
Senior Software Engineer
Assurant Mortgage Solutions
817.788.4482
mmusgrove@emortgagelogic.com
matthew.musgrove@assurant.com
Matthew Musgrove <mmusgrove@emortgagelogic.com> writes: > One of our instances has been behaving -- oddly. Most queries are blazing fast. It appears to just be some of the statviews that are slow. It sounds like requests for stats updates are not getting through to the collector. I wonder if your kernel is blocking those packets for some reason (probably ill-considered iptables filters?). There is a test for packet filtering conducted during postmaster startup, but we don't repeat it after that. > Does anyone have any suggestions on how to: > - see what the statistics collector is doing? strace? You could at least confirm whether it's receiving anything. regards, tom lane
On 07/25/2016 11:50 AM, Matthew Musgrove wrote: > One of our instances has been behaving -- oddly. Most queries are > blazing fast. It appears to just be some of the stat views that are slow. > > > Does anyone have any suggestions on how to: > - see what the statistics collector is doing? > - tell the postmaster to give the statistics collector the up to date > config? > - get stats working without restarting the instance? (I am not yet sure > when that can happen but it would have to be scheduled [in advance] to > occur during an after hours maintenance window. Anything in the system logs that might help? > > Thanks in advance, > Matt > > -- -- Adrian Klaver adrian.klaver@aklaver.com
On 07/25/2016 03:20 PM, Tom Lane wrote:
I started going through the log files under /var/log again. I didn't see anything until I made it back to /var/log/messages for July 7th. I noticed some entries for system-config-network at 14:05.
I looked at ifconfig and the files in /etc/sysconfig/network-scripts but everything looked fine. I checked what ports the postmaster and stats collector were using ('lsof -i -P | grep <port>'). I'm not sure how I missed this before but they were using an IPv6 localhost port.
I asked what network changes were made on July 7th around 14:05 and was told that they disabled IPv6! I had them turn it back on and stats started working immediately.
I have since changed the config from listen_addresses = '*' to listen_addresses = '0.0.0.0' but we won't be restarting until our next maintenance window (in about a week) when they will disable IPv6 again.
Tom and Adrian, thank you both! :)
Today was the first day that I have had any significant amount of time to investigate since probably the 12th. I had checked iptables before. When I saw this message I checked, double-checked and even re-ordered the iptables rules but it didn't make any difference.Matthew Musgrove <mmusgrove@emortgagelogic.com> writes:One of our instances has been behaving -- oddly. Most queries are blazing fast. It appears to just be some of the stat views that are slow.It sounds like requests for stats updates are not getting through to the collector. I wonder if your kernel is blocking those packets for some reason (probably ill-considered iptables filters?). There is a test for packet filtering conducted during postmaster startup, but we don't repeat it after that.
I started going through the log files under /var/log again. I didn't see anything until I made it back to /var/log/messages for July 7th. I noticed some entries for system-config-network at 14:05.
I looked at ifconfig and the files in /etc/sysconfig/network-scripts but everything looked fine. I checked what ports the postmaster and stats collector were using ('lsof -i -P | grep <port>'). I'm not sure how I missed this before but they were using an IPv6 localhost port.
I asked what network changes were made on July 7th around 14:05 and was told that they disabled IPv6! I had them turn it back on and stats started working immediately.
I have since changed the config from listen_addresses = '*' to listen_addresses = '0.0.0.0' but we won't be restarting until our next maintenance window (in about a week) when they will disable IPv6 again.
Tom and Adrian, thank you both! :)
--
Matthew Musgrove
Matthew Musgrove <mmusgrove@emortgagelogic.com> writes: > I asked what network changes were made on July 7th around 14:05 and was told that they disabled IPv6! I had them turn itback on and stats started working immediately. Ooops ... > I have since changed the config from listen_addresses = '*' to listen_addresses = '0.0.0.0' but we won't be restartinguntil our next maintenance window (in about a week) when they will disable IPv6 again. AFAIR, listen_addresses doesn't have any impact on our choice of stats collector loopback port: we just look up "localhost" at postmaster start, and use whatever that resolves as. It's fairly common these days for that to resolve as both "::1" and "127.0.0.1", and we'll use whichever of those is first found to work. So this change would probably have been fine if it were made when PG wasn't running, but it was pretty foolish to do it to a running system without even looking to see if any IPV6 connections were open :-( regards, tom lane