Re: BUG #12071: Stat collector went crasy (50MB/s constant writes) - Mailing list pgsql-bugs

From Tomas Vondra
Subject Re: BUG #12071: Stat collector went crasy (50MB/s constant writes)
Date
Msg-id 547632CD.9080609@fuzzy.cz
Whole thread Raw
In response to Re: BUG #12071: Stat collector went crasy (50MB/s constant writes)  (Maxim Boguk <maxim.boguk@gmail.com>)
List pgsql-bugs
On 26.11.2014 19:48, Maxim Boguk wrote:
>
>
> On Wed, Nov 26, 2014 at 9:08 PM, Maxim Boguk <maxim.boguk@gmail.com
> <mailto:maxim.boguk@gmail.com>> wrote:
>
>
> On one of my servers stat collector went crasy on idle system. It
> constantly write around 50MB/s on idle database (no activity at
> all).
>
> OK there are results of some low-level gdb kung-fu:
>
> problem point is:
>             /*
>              * Write the stats file if a new request has arrived that is not
>              * satisfied by existing file.
>              */
>             if (last_statwrite < last_statrequest)
>                 pgstat_write_statsfile(false);
>
> on some loop iteration difference between last_statrequest and
> last_statwrite is around 23000-29000 (e.g. 20-30ms).
>
> Hm may be there some clock drift between CPU cores?

Maybe, although I'd expect much smaller difference in such case. 20-30ms
is a bit too much, IMHO. Which clock source are you using?

cat /sys/devices/system/clocksource/clocksource0/available_clocksource

Do you have NTP on the machine? Is it working correctly?

There's code in pgstat.c that should handle such clock skew cases,
around line 3560:

https://github.com/postgres/postgres/blob/REL9_2_STABLE/src/backend/postmaster/pgstat.c#L3564

Do you get those messages in the log?

Can you perform more gdb-fu? Are the last_statwrite / last_statrequest
timestamps advance "equally"?

Do you know what processes are sending the requests to write the file?

The only source of 'write statfile' requests I can think of at the
moment is the autovacuum launcher (who needs stats for all databases).
Let's assume there really is ~30ms clock skew between the postmaster and
the launcher. If I read the code correctly, then:

 1) the launcher accepts only stats not older than 10ms

 2) the timestamp is read from the file, and is representing timestamp
    when writing the file started

 3) if the clock skew is >10ms, or if writing the file takes >10ms (or
    combination of those two times), a request is sent

 4) this is done repeatedly for 10 seconds (requests 10ms apart)

Maybe the stats file grew a bit, and now the difference got slightly
over 10ms?

The other thing that might happen is inability to write the new file.
But that's inconsistent with small file size (I've seen that e.g. for
files exceeding the tmpfs size limit, but if you only have ~5MB
pgstat.stat file that's unlikely).

Also, it would cause "could not write temporary statistics" in the log,
and the difference between the timestamps would grow (while you're
saying it's ~30ms).

regards
Tomas

pgsql-bugs by date:

Previous
From: Maxim Boguk
Date:
Subject: Re: BUG #12071: Stat collector went crasy (50MB/s constant writes)
Next
From: Maxim Boguk
Date:
Subject: Re: BUG #12071: Stat collector went crasy (50MB/s constant writes)