Thread: Weirdness with the stats collector process

Weirdness with the stats collector process

From

Matthew Musgrove

Date:

25 July 2016, 19:44:11

One of our instances has been behaving -- oddly. Most queries are blazing fast. It appears to just be some of the stat views that are slow.

Queries against the following views are quick: pg_stat_activity, pg_stat_xact_all_tables, pg_stat_xact_sys_tables, pg_stat_xact_user_tables, pg_statio_sys_sequences (no system sequences), pg_stat_xact_user_functions, and pg_stat_replication.

When I query one of the other statistics views, the queries take roughly 10 seconds when they used to take milliseconds. Here's an explain of one such query: https://explain.depesz.com/s/IVEu

Logs show 'using stale statistics instead of current ones because stats collector is not responding' when I query the stats views.

CPU usage is very low. Disk I/O looks okay to me.

-bash-4.1$ iostat
Linux 2.6.32-573.12.1.el6.x86_64 (hostname)         07/25/2016      _x86_64_        (16 CPU)

avg-cpu: %user   %nice %system %iowait %steal   %idle
           2.80    0.00    0.15    0.01    0.00   97.04

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda               0.45         4.85         7.45   33142948   50914838
sdb              31.48      1403.57       713.55 9597023154 4878965216
dm-0              0.14         0.58         0.56    3949720    3847592
dm-1              1.00         4.26         6.83   29155482   46727688

Most of the files in $PGDATA/pg_stat_tmp haven't been updated by PostgreSQL since July 7th at 14:16. The one exception is pgss_query_texts.stat. I shouldn't have but I touched $PGDATA/pg_stat_tmp/db* on July 12th at 15:14 (doing so did trigger an autovacuum though).

I set track_activities and track_counts to off and reloaded config (pg_ctl reload). The log indicated that it noticed the change. I created a directory /dev/shm/pg_stat_tmp, set track_activities and track_counts on, set stats_temp_directory to '/dev/shm/pg_stat_tmp' and reloaded config. Again, the log indicated that it noticed the change but nothing happened with stats. The new directory is empty.

I notified the statistics collector to quit (kill -QUIT). It did and the postmaster restarted it. The new stats directory is still empty. The only updates to the old stats directory is the pgss_query_texts.stat file.

I tried getting stack traces and the like with gdb and strace but it wasn't really helpful.

Most of what I have tried was the result of help by Jim Nasby on postgresteam.slack.com. He and Lukas Fittl suggested that I post about my problem. There's one thing that we all agree on... It should not be so hard to figure out what is going on with the statistics collector.

I tried turning stats off and on again today. Here's what the two pg_stat_tmp directories look like today:
so now the directory looks like:

-bash-4.1$ ls -alh 9.4/data/pg_stat_tmp/
total 4.3M
drwx------ 2 postgres postgres 4.0K Jul 7 14:16 .
drwx------ 19 postgres postgres 4.0K Jul 25 11:20 ..
-rw------- 1 postgres postgres 3.4K Jul 12 15:14 db_0.stat
-rw------- 1 postgres postgres 26K Jul 12 15:14 db_13003.stat
-rw------- 1 postgres postgres 13K Jul 12 15:14 db_1.stat
-rw------- 1 postgres postgres 140K Jul 12 15:14 db_2473933.stat
-rw------- 1 postgres postgres 203K Jul 12 15:14 db_2476698.stat
-rw------- 1 postgres postgres 27K Jul 12 15:14 db_2595334.stat
-rw------- 1 postgres postgres 28K Jul 12 15:14 db_2631762.stat
-rw------- 1 postgres postgres 1.5K Jul 7 14:16 global.stat
-rw------- 1 postgres postgres 3.9M Jul 25 13:39 pgss_query_texts.stat
-bash-4.1$ ls -alh /dev/shm/pg_stat_tmp/
total 0
drwxr-xr-x 2 postgres postgres 40 Jul 12 17:19 .
drwxrwxrwt 3 root     root     80 Jul 11 17:31 ..

Does anyone have any suggestions on how to:
- see what the statistics collector is doing?
- tell the postmaster to give the statistics collector the up to date config?
- get stats working without restarting the instance? (I am not yet sure when that can happen but it would have to be scheduled [in advance] to occur during an after hours maintenance window.

Thanks in advance,
Matt

Matthew Musgrove
Senior Software Engineer
Assurant Mortgage Solutions
817.788.4482
mmusgrove@emortgagelogic.com
matthew.musgrove@assurant.com

Re: Weirdness with the stats collector process

From

Tom Lane

Date:

25 July 2016, 20:21:03

Matthew Musgrove <mmusgrove@emortgagelogic.com> writes:
> One of our instances has been behaving -- oddly. Most queries are blazing fast. It appears to just be some of the
statviews that are slow. 

It sounds like requests for stats updates are not getting through to the
collector.  I wonder if your kernel is blocking those packets for some
reason (probably ill-considered iptables filters?).  There is a test
for packet filtering conducted during postmaster startup, but we don't
repeat it after that.

> Does anyone have any suggestions on how to:
>  - see what the statistics collector is doing?

strace?  You could at least confirm whether it's receiving anything.

            regards, tom lane

Re: Weirdness with the stats collector process

From

Adrian Klaver

Date:

25 July 2016, 20:51:01

On 07/25/2016 11:50 AM, Matthew Musgrove wrote:
> One of our instances has been behaving -- oddly. Most queries are
> blazing fast. It appears to just be some of the stat views that are slow.
>

>
> Does anyone have any suggestions on how to:
>  - see what the statistics collector is doing?
>  - tell the postmaster to give the statistics collector the up to date
> config?
>  - get stats working without restarting the instance? (I am not yet sure
> when that can happen but it would have to be scheduled [in advance] to
> occur during an after hours maintenance window.


Anything in the system logs that might help?

>
> Thanks in advance,
> Matt
>
> --


--
Adrian Klaver
adrian.klaver@aklaver.com

Re: Weirdness with the stats collector process

From

Matthew Musgrove

Date:

26 July 2016, 19:40:24

On 07/25/2016 03:20 PM, Tom Lane wrote:

Matthew Musgrove <mmusgrove@emortgagelogic.com> writes:

One of our instances has been behaving -- oddly. Most queries are blazing fast. It appears to just be some of the stat views that are slow.

It sounds like requests for stats updates are not getting through to the
collector.  I wonder if your kernel is blocking those packets for some
reason (probably ill-considered iptables filters?).  There is a test
for packet filtering conducted during postmaster startup, but we don't
repeat it after that.

Today was the first day that I have had any significant amount of time to investigate since probably the 12th. I had checked iptables before. When I saw this message I checked, double-checked and even re-ordered the iptables rules but it didn't make any difference.

I started going through the log files under /var/log again. I didn't see anything until I made it back to /var/log/messages for July 7th. I noticed some entries for system-config-network at 14:05.

I looked at ifconfig and the files in /etc/sysconfig/network-scripts but everything looked fine. I checked what ports the postmaster and stats collector were using ('lsof -i -P | grep <port>'). I'm not sure how I missed this before but they were using an IPv6 localhost port.

I asked what network changes were made on July 7th around 14:05 and was told that they disabled IPv6! I had them turn it back on and stats started working immediately.

I have since changed the config from listen_addresses = '*' to listen_addresses = '0.0.0.0' but we won't be restarting until our next maintenance window (in about a week) when they will disable IPv6 again.

Tom and Adrian, thank you both! :)

Matthew Musgrove

Re: Weirdness with the stats collector process

From

Tom Lane

Date:

26 July 2016, 19:59:22

Matthew Musgrove <mmusgrove@emortgagelogic.com> writes:
> I asked what network changes were made on July 7th around 14:05 and was told that they disabled IPv6! I had them turn
itback on and stats started working immediately. 

Ooops ...

> I have since changed the config from listen_addresses = '*' to listen_addresses = '0.0.0.0' but we won't be
restartinguntil our next maintenance window (in about a week) when they will disable IPv6 again. 

AFAIR, listen_addresses doesn't have any impact on our choice of stats
collector loopback port: we just look up "localhost" at postmaster start,
and use whatever that resolves as.  It's fairly common these days for that
to resolve as both "::1" and "127.0.0.1", and we'll use whichever of those
is first found to work.  So this change would probably have been fine if
it were made when PG wasn't running, but it was pretty foolish to do it to
a running system without even looking to see if any IPV6 connections were
open :-(

            regards, tom lane