Re: Better way of dealing with pgstat wait timeout during buildfarm runs? - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Better way of dealing with pgstat wait timeout during buildfarm runs?
Date
Msg-id 549C8B62.4030201@fuzzy.cz
Whole thread Raw
In response to Re: Better way of dealing with pgstat wait timeout during buildfarm runs?  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Better way of dealing with pgstat wait timeout during buildfarm runs?  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On 25.12.2014 22:40, Tom Lane wrote:
> Tomas Vondra <tv@fuzzy.cz> writes:
>> The strange thing is that the split happened ~2 years ago, which is
>> inconsistent with the sudden increase of this kind of issues. So maybe
>> something changed on that particular animal (a failing SD card causing
>> I/O stalls, perhaps)?
> 
> I think that hamster has basically got a tin can and string for an I/O
> subsystem.  It's not real clear to me whether there's actually been an

Yes. It's called "SD card".

> increase in "wait timeout" failures recently; somebody would have to
> go through and count them before I'd have much faith in that thesis.

That's what I did. On hamster I see this (in the HEAD):

2014-12-25 16:00:07 yes
2014-12-24 16:00:07 yes
2014-12-23 16:00:07 yes
2014-12-22 16:00:07 yes
2014-12-19 16:00:07 yes
2014-12-15 16:00:11 no
2014-10-25 16:00:06 no
2014-10-24 16:00:06 no
2014-10-23 16:00:06 no
2014-10-22 16:00:06 no
2014-10-21 16:00:07 no
2014-10-19 16:00:06 no
2014-09-28 16:00:06 no
2014-09-26 16:00:07 no
2014-08-28 16:00:06 no
2014-08-12 16:00:06 no
2014-08-05 22:04:48 no
2014-07-19 01:53:30 no
2014-07-06 16:00:06 no
2014-07-04 16:00:06 no
2014-06-29 16:00:06 no
2014-05-09 16:00:04 no
2014-05-07 16:00:04 no
2014-05-04 16:00:04 no
2014-04-28 16:00:04 no
2014-04-18 16:00:04 no
2014-04-04 16:00:04 no

(where "yes" means "pgstat wait timeout" is in the logs). On chipmunk,
the trend is much less convincing (but there's much less failures, and
only 3 of them failed because of the "pgstat wait timeout").

However, it's worth mentioning that all the pgstat failures happened at
"16:00:07" and most of the older failures are before that. So it may be
that it was failing because of something else, and the pgstat timeout
could not happen because of that. OTOH, there's a fair amount of
successful runs.


> However, I notice that at least the last few occurrences on "hamster"
> all seem to have been in this parallel block:
> 
> test: brin gin gist spgist privileges security_label collate matview
> lock replica_identity rowsecurity object_address
> 
> which as recently as 9.4 contained just these tests:
> 
> test: privileges security_label collate matview lock replica_identity
> 
> I'm fairly sure that the index-related tests in this batch are I/O 
> intensive, and since they were not there at all six months ago, it's
> not hard to believe that this block of tests has far greater I/O
> demands than used to exist. Draw your own conclusions ...

Yes, that might be the culprit here. Would be interesting to know what's
happening on the machine while the tests are running to confirm this
hypothesis.

regards
Tomas



pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: Better way of dealing with pgstat wait timeout during buildfarm runs?
Next
From: Noah Misch
Date:
Subject: Re: Securing "make check" (CVE-2014-0067)