Thread: pgstat wait timeout

pgstat wait timeout

From
Steve Crawford
Date:
I have a system (9.0.4 on Ubuntu Server 10.04 LTS x86_64) that is 
currently in test/dev mode. I'm currently seeing the following messages 
occurring every few seconds:

...
Dec 27 17:43:22 foo postgres[23693]: [6-1] : WARNING:  pgstat wait timeout
Dec 27 17:43:27 foo postgres[27324]: [71400-1] : WARNING:  pgstat wait 
timeout
Dec 27 17:43:33 foo postgres[23695]: [6-1] : WARNING:  pgstat wait timeout
Dec 27 17:43:54 foo postgres[27324]: [71401-1] : WARNING:  pgstat wait 
timeout
Dec 27 17:43:59 foo postgres[23697]: [6-1] : WARNING:  pgstat wait timeout
Dec 27 17:44:04 foo postgres[27324]: [71402-1] : WARNING:  pgstat wait 
timeout
Dec 27 17:44:09 foo postgres[23715]: [6-1] : WARNING:  pgstat wait timeout
Dec 27 17:44:17 foo postgres[27324]: [71403-1] : WARNING:  pgstat wait 
timeout
Dec 27 17:44:22 foo postgres[23716]: [6-1] : WARNING:  pgstat wait timeout
Dec 27 17:44:27 foo postgres[27324]: [71404-1] : WARNING:  pgstat wait 
timeout
Dec 27 17:44:33 foo postgres[23718]: [6-1] : WARNING:  pgstat wait timeout
Dec 27 17:44:54 foo postgres[27324]: [71405-1] : WARNING:  pgstat wait 
timeout
Dec 27 17:44:59 foo postgres[23824]: [6-1] : WARNING:  pgstat wait timeout
Dec 27 17:45:04 foo postgres[27324]: [71406-1] : WARNING:  pgstat wait 
timeout

I can't correlate events exactly, but the messages seem to have started 
shortly after I dropped a pgbench user and database. My Googling turned 
up various requests for debugging info on "hackers". Since the system 
isn't live, I haven't touched it in case anyone wants me to collect 
debugging info.

Otherwise, I plan on just blowing the install away and replacing it with 9.1

Cheers,
Steve



Re: pgstat wait timeout

From
Alvaro Herrera
Date:
Excerpts from Steve Crawford's message of mar dic 27 22:51:06 -0300 2011:
> I have a system (9.0.4 on Ubuntu Server 10.04 LTS x86_64) that is
> currently in test/dev mode. I'm currently seeing the following messages
> occurring every few seconds:
>
> ...
> Dec 27 17:43:22 foo postgres[23693]: [6-1] : WARNING:  pgstat wait timeout
> Dec 27 17:43:27 foo postgres[27324]: [71400-1] : WARNING:  pgstat wait
> timeout
> Dec 27 17:43:33 foo postgres[23695]: [6-1] : WARNING:  pgstat wait timeout
> Dec 27 17:43:54 foo postgres[27324]: [71401-1] : WARNING:  pgstat wait
> timeout

Hm, so can you strace the stats collector to see what it's doing?  Maybe
grab a backtrace with GDB from it before anything else.

My guess is 27324 is the autovac launcher and the others are autovac
workers just as they die.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: pgstat wait timeout

From
Steve Crawford
Date:
On 12/28/2011 05:05 AM, Alvaro Herrera wrote:
> Excerpts from Steve Crawford's message of mar dic 27 22:51:06 -0300 2011:
>> I have a system (9.0.4 on Ubuntu Server 10.04 LTS x86_64) that is
>> currently in test/dev mode. I'm currently seeing the following messages
>> occurring every few seconds:
>>
>> ...
>> Dec 27 17:43:22 foo postgres[23693]: [6-1] : WARNING:  pgstat wait timeout
>> Dec 27 17:43:27 foo postgres[27324]: [71400-1] : WARNING:  pgstat wait
>> timeout
>> Dec 27 17:43:33 foo postgres[23695]: [6-1] : WARNING:  pgstat wait timeout
>> Dec 27 17:43:54 foo postgres[27324]: [71401-1] : WARNING:  pgstat wait
>> timeout
> Hm, so can you strace the stats collector to see what it's doing?  Maybe
> grab a backtrace with GDB from it before anything else.
>
> My guess is 27324 is the autovac launcher and the others are autovac
> workers just as they die.
>
You are correct. 27324 is the launcher and the others are autovac 
workers. Here's the strace of the stats collector process:

getppid()                               = 27320
poll([{fd=8, events=POLLIN|POLLERR}], 1, 2000) = 0 (Timeout)
getppid()                               = 27320
poll([{fd=8, events=POLLIN|POLLERR}], 1, 2000) = 0 (Timeout)
getppid()                               = 27320
poll([{fd=8, events=POLLIN|POLLERR}], 1, 2000) = 0 (Timeout)
....rinse...lather...repeat...ad nauseum...

And the backtrace:

#0  0x00007ff4d2e80f58 in poll () from /lib/libc.so.6
#1  0x00007ff4d4e6f465 in ?? ()
#2  0x00007ff4d4e6fd83 in pgstat_start ()
#3  0x00007ff4d4e73475 in ?? ()
#4 <signal handler called>
#5  0x00007ff4d2e85fd3 in select () from /lib/libc.so.6
#6  0x00007ff4d4e71b93 in ?? ()
#7  0x00007ff4d4e74b01 in PostmasterMain ()
#8  0x00007ff4d4e193b3 in main ()

Cheers,
Steve



Re: pgstat wait timeout

From
Alvaro Herrera
Date:
Excerpts from Steve Crawford's message of mié dic 28 13:24:37 -0300 2011:
> On 12/28/2011 05:05 AM, Alvaro Herrera wrote:
> > Excerpts from Steve Crawford's message of mar dic 27 22:51:06 -0300 2011:
> >> I have a system (9.0.4 on Ubuntu Server 10.04 LTS x86_64) that is
> >> currently in test/dev mode. I'm currently seeing the following messages
> >> occurring every few seconds:
> >>
> >> ...
> >> Dec 27 17:43:22 foo postgres[23693]: [6-1] : WARNING:  pgstat wait timeout
> >> Dec 27 17:43:27 foo postgres[27324]: [71400-1] : WARNING:  pgstat wait
> >> timeout
> >> Dec 27 17:43:33 foo postgres[23695]: [6-1] : WARNING:  pgstat wait timeout
> >> Dec 27 17:43:54 foo postgres[27324]: [71401-1] : WARNING:  pgstat wait
> >> timeout
> > Hm, so can you strace the stats collector to see what it's doing?  Maybe
> > grab a backtrace with GDB from it before anything else.
> >
> > My guess is 27324 is the autovac launcher and the others are autovac
> > workers just as they die.
> >
> You are correct. 27324 is the launcher and the others are autovac
> workers. Here's the strace of the stats collector process:
>
> getppid()                               = 27320
> poll([{fd=8, events=POLLIN|POLLERR}], 1, 2000) = 0 (Timeout)
> getppid()                               = 27320
> poll([{fd=8, events=POLLIN|POLLERR}], 1, 2000) = 0 (Timeout)
> getppid()                               = 27320
> poll([{fd=8, events=POLLIN|POLLERR}], 1, 2000) = 0 (Timeout)
> ....rinse...lather...repeat...ad nauseum...

Weird ... even across more "pgstat wait timeout" messages?  It's like
it's not getting the "inquiry" messages that would tell it to write the
file ... something wrong with the UDP socket perhaps?

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: pgstat wait timeout

From
Steve Crawford
Date:
On 12/28/2011 09:34 AM, Alvaro Herrera wrote:
> Excerpts from Steve Crawford's message of mié dic 28 13:24:37 -0300 2011:
>> On 12/28/2011 05:05 AM, Alvaro Herrera wrote:
>>> Excerpts from Steve Crawford's message of mar dic 27 22:51:06 -0300 2011:
>>>> I have a system (9.0.4 on Ubuntu Server 10.04 LTS x86_64) that is
>>>> currently in test/dev mode. I'm currently seeing the following messages
>>>> occurring every few seconds:
>>>>
>>>> ...
>>>> Dec 27 17:43:22 foo postgres[23693]: [6-1] : WARNING:  pgstat wait timeout
>>>> Dec 27 17:43:27 foo postgres[27324]: [71400-1] : WARNING:  pgstat wait
>>>> timeout
>>>> Dec 27 17:43:33 foo postgres[23695]: [6-1] : WARNING:  pgstat wait timeout
>>>> Dec 27 17:43:54 foo postgres[27324]: [71401-1] : WARNING:  pgstat wait
>>>> timeout
>>> Hm, so can you strace the stats collector to see what it's doing?  Maybe
>>> grab a backtrace with GDB from it before anything else.
>>>
>>> My guess is 27324 is the autovac launcher and the others are autovac
>>> workers just as they die.
>>>
>> You are correct. 27324 is the launcher and the others are autovac
>> workers. Here's the strace of the stats collector process:
>>
>> getppid()                               = 27320
>> poll([{fd=8, events=POLLIN|POLLERR}], 1, 2000) = 0 (Timeout)
>> getppid()                               = 27320
>> poll([{fd=8, events=POLLIN|POLLERR}], 1, 2000) = 0 (Timeout)
>> getppid()                               = 27320
>> poll([{fd=8, events=POLLIN|POLLERR}], 1, 2000) = 0 (Timeout)
>> ....rinse...lather...repeat...ad nauseum...
> Weird ... even across more "pgstat wait timeout" messages?  It's like
> it's not getting the "inquiry" messages that would tell it to write the
> file ... something wrong with the UDP socket perhaps?
>
Bingo!

postgres  27325 postgres    8u     *IPv6*            5379428
0t0        UDP localhost:47204->localhost:47204

In working on diagnosing a network timeout issue over an IPv4 to IPv4
VPN I disabled IPv6 via sysctl on this machine and pretty much forgot
about it since we are still IPv4 internally. But PostgreSQL had already
established a (now non-functional) IPv6 local connection. Re-enabling
IPv6, as it was not related to the VPN timeouts, corrected the "pgstat
wait timeout" issue.

Cheers,
Steve