Re: Better way of dealing with pgstat wait timeout during buildfarm runs? - Mailing list pgsql-hackers

From Matt Kelly
Subject Re: Better way of dealing with pgstat wait timeout during buildfarm runs?
Date
Msg-id CA+KcUki1DwxqbBPt8ELyVDtS4JrK=hmiEwOS5+6LqtU-MhrMdg@mail.gmail.com
Whole thread Raw
In response to Re: Better way of dealing with pgstat wait timeout during buildfarm runs?  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: Better way of dealing with pgstat wait timeout during buildfarm runs?
List pgsql-hackers
Sure, but nobody who is not a developer is going to care about that.
A typical user who sees "pgstat wait timeout", or doesn't, isn't going
to be able to make anything at all out of that.

As a user, I wholeheartedly disagree.

That warning helped me massively in diagnosing an unhealthy database server in the past at TripAdvisor (i.e. high end server class box, not a raspberry pie).  I have realtime monitoring that looks at pg_stat_database at regular intervals particularly for the velocity of change of xact_commit and xact_rollback columns, similar to how check_postgres does it. https://github.com/bucardo/check_postgres/blob/master/check_postgres.pl#L4234

When one of those servers was unhealthy, it stopped reporting statistics for 30 seconds+ at a time.  My dashboard which polled far more frequently than that indicated the server was normally processing 0 tps with intermittent spikes. I went directly onto the server and sampled pg_stat_database.  That warning was the only thing that directly indicated that the statistics collector was not to be trusted.  It obviously was a victim of what was going on in the server, but its pretty important to know when your methods for measuring server health are lying to you.  The spiky TPS at first glance appears like some sort of live lock, not just that the server is overloaded.

Now, I know: 0 change in stats = collector broken.  Rereading the docks,

 Also, the collector itself emits a new report at most once per PGSTAT_STAT_INTERVAL milliseconds (500 ms unless altered while building the server).

Without context this merely reads: "We sleep for 500ms, plus the time to write the file, plus whenever the OS decides to enforce the timer interrupt... so like 550-650ms."  It doesn't read, "When server is unhealthy, but _still_ serving queries, the stats collector might not be able to keep up and will just stop reporting stats all together."

I think the warning is incredibly valuable.  Along those lines I'd also love to see a pg_stat_snapshot_timestamp() for monitoring code to use to determine if its using a stale snapshot, as well as helping to smooth graphs of the statistics that are based on highly granular snapshotting.

- Matt Kelly

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]
Next
From: Amit Kapila
Date:
Subject: Re: Parallel Seq Scan