Re: Better way of dealing with pgstat wait timeout during buildfarm runs? - Mailing list pgsql-hackers

Andres Freund <andres@2ndquadrant.com> writes:
> So I think a better way to deal with that warning would be a good
> idea. Besides somehow making the mechanism there are two ways to attack
> this that I can think of, neither of them awe inspiring:

> 1) Make that WARNING a LOG message instead. Since those don't get send
> to the client with default settings...
> 2) Increase PGSTAT_MAX_WAIT_TIME even further than what 99b545 increased
> it to.

Yeah, I've been getting more annoyed by that too lately.  I keep wondering
though whether there's an actual bug underneath that behavior that we're
failing to see.  PGSTAT_MAX_WAIT_TIME is already 10 seconds; it's hard to
credit that increasing it still further would be "fixing" anything.
The other change would also mainly just sweep the issue under the rug,
if there is any issue and it's not just that we're overloading
underpowered buildfarm machines.  (Maybe a better fix would be to reduce
MAX_CONNECTIONS for the tests on these machines?)

I wonder whether when multiple processes are demanding statsfile updates,
there's some misbehavior that causes them to suck CPU away from the stats
collector and/or convince it that it doesn't need to write anything.
There are odd things in the logs in some of these events.  For example in
today's failure on hamster,
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamster&dt=2014-12-25%2016%3A00%3A07
there are two client-visible wait-timeout warnings, one each in the
gist and spgist tests.  But in the postmaster log we find these in
fairly close succession:

[549c38ba.724d:2] WARNING:  pgstat wait timeout
[549c39b1.73e7:10] WARNING:  pgstat wait timeout
[549c38ba.724d:3] WARNING:  pgstat wait timeout

Correlating these with other log entries shows that the first and third
are from the autovacuum launcher while the second is from the gist test
session.  So the spgist failure failed to get logged, and in any case the
big picture is that we had four timeout warnings occurring in a pretty
short span of time, in a parallel test set that's not all that demanding
(12 parallel tests, well below our max).  Not sure what to make of that.

BTW, I notice that in the current state of pgstat.c, all the logic for
keeping track of request arrival times is dead code, because nothing is
actually looking at DBWriteRequest.request_time.  This makes me think that
somebody simplified away some logic we maybe should have kept.  But if
we're going to leave it like this, we could replace the DBWriteRequest
data structure with a simple OID list and save a fair amount of code.
        regards, tom lane



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Better way of dealing with pgstat wait timeout during buildfarm runs?
Next
From: Andres Freund
Date:
Subject: Re: Better way of dealing with pgstat wait timeout during buildfarm runs?