Re: Better way of dealing with pgstat wait timeout during buildfarm runs? - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Better way of dealing with pgstat wait timeout during buildfarm runs?
Date
Msg-id 15221.1419559142@sss.pgh.pa.us
Whole thread Raw
In response to Re: Better way of dealing with pgstat wait timeout during buildfarm runs?  (Tomas Vondra <tv@fuzzy.cz>)
Responses Re: Better way of dealing with pgstat wait timeout during buildfarm runs?
List pgsql-hackers
Tomas Vondra <tv@fuzzy.cz> writes:
> On 25.12.2014 22:40, Tom Lane wrote:
>> I think that hamster has basically got a tin can and string for an I/O
>> subsystem.  It's not real clear to me whether there's actually been an
>> increase in "wait timeout" failures recently; somebody would have to
>> go through and count them before I'd have much faith in that thesis.

> That's what I did. On hamster I see this (in the HEAD):

> 2014-12-25 16:00:07 yes
> 2014-12-24 16:00:07 yes
> 2014-12-23 16:00:07 yes
> 2014-12-22 16:00:07 yes
> 2014-12-19 16:00:07 yes
> 2014-12-15 16:00:11 no
> 2014-10-25 16:00:06 no
> 2014-10-24 16:00:06 no
> 2014-10-23 16:00:06 no
> 2014-10-22 16:00:06 no
> 2014-10-21 16:00:07 no
> 2014-10-19 16:00:06 no
> 2014-09-28 16:00:06 no
> 2014-09-26 16:00:07 no
> 2014-08-28 16:00:06 no
> 2014-08-12 16:00:06 no
> 2014-08-05 22:04:48 no
> 2014-07-19 01:53:30 no
> 2014-07-06 16:00:06 no
> 2014-07-04 16:00:06 no
> 2014-06-29 16:00:06 no
> 2014-05-09 16:00:04 no
> 2014-05-07 16:00:04 no
> 2014-05-04 16:00:04 no
> 2014-04-28 16:00:04 no
> 2014-04-18 16:00:04 no
> 2014-04-04 16:00:04 no

> (where "yes" means "pgstat wait timeout" is in the logs). On chipmunk,
> the trend is much less convincing (but there's much less failures, and
> only 3 of them failed because of the "pgstat wait timeout").

mereswine's history is also pretty interesting in this context.  That
series makes it look like the probability of "pgstat wait timeout" took
a big jump around the beginning of December, especially if you make the
unproven-but-not-unreasonable assumption that the two pg_upgradecheck
failures since then were also wait timeout failures.  That's close enough
after commit 88fc71926392115c (Nov 19) to make me suspect that that was
what put us over the edge: that added a bunch more I/O *and* a bunch more
statistics demands to this one block of parallel tests.

But even if we are vastly overstressing the I/O subsystem on these boxes,
why is it manifesting like this?  pgstat never fsyncs the stats temp file,
so it should not have to wait for physical I/O I'd think.  Or perhaps the
file rename() operations get fsync'd behind the scenes by the filesystem?
        regards, tom lane



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Some other odd buildfarm failures
Next
From: Tomas Vondra
Date:
Subject: Re: Better way of dealing with pgstat wait timeout during buildfarm runs?