Home > mailing lists

Re: Better way of dealing with pgstat wait timeout during buildfarm runs? - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: Better way of dealing with pgstat wait timeout during buildfarm runs?
Date	December 26, 2014 01:10:53
Msg-id	549C8B62.4030201@fuzzy.cz Whole thread Raw
In response to	Re: Better way of dealing with pgstat wait timeout during buildfarm runs? (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: Better way of dealing with pgstat wait timeout during buildfarm runs? (Tom Lane <tgl@sss.pgh.pa.us>)
List	pgsql-hackers

Tree view

On 25.12.2014 22:40, Tom Lane wrote:
> Tomas Vondra <tv@fuzzy.cz> writes:
>> The strange thing is that the split happened ~2 years ago, which is
>> inconsistent with the sudden increase of this kind of issues. So maybe
>> something changed on that particular animal (a failing SD card causing
>> I/O stalls, perhaps)?
> 
> I think that hamster has basically got a tin can and string for an I/O
> subsystem.  It's not real clear to me whether there's actually been an

Yes. It's called "SD card".

> increase in "wait timeout" failures recently; somebody would have to
> go through and count them before I'd have much faith in that thesis.

That's what I did. On hamster I see this (in the HEAD):

2014-12-25 16:00:07 yes
2014-12-24 16:00:07 yes
2014-12-23 16:00:07 yes
2014-12-22 16:00:07 yes
2014-12-19 16:00:07 yes
2014-12-15 16:00:11 no
2014-10-25 16:00:06 no
2014-10-24 16:00:06 no
2014-10-23 16:00:06 no
2014-10-22 16:00:06 no
2014-10-21 16:00:07 no
2014-10-19 16:00:06 no
2014-09-28 16:00:06 no
2014-09-26 16:00:07 no
2014-08-28 16:00:06 no
2014-08-12 16:00:06 no
2014-08-05 22:04:48 no
2014-07-19 01:53:30 no
2014-07-06 16:00:06 no
2014-07-04 16:00:06 no
2014-06-29 16:00:06 no
2014-05-09 16:00:04 no
2014-05-07 16:00:04 no
2014-05-04 16:00:04 no
2014-04-28 16:00:04 no
2014-04-18 16:00:04 no
2014-04-04 16:00:04 no

(where "yes" means "pgstat wait timeout" is in the logs). On chipmunk,
the trend is much less convincing (but there's much less failures, and
only 3 of them failed because of the "pgstat wait timeout").

However, it's worth mentioning that all the pgstat failures happened at
"16:00:07" and most of the older failures are before that. So it may be
that it was failing because of something else, and the pgstat timeout
could not happen because of that. OTOH, there's a fair amount of
successful runs.


> However, I notice that at least the last few occurrences on "hamster"
> all seem to have been in this parallel block:
> 
> test: brin gin gist spgist privileges security_label collate matview
> lock replica_identity rowsecurity object_address
> 
> which as recently as 9.4 contained just these tests:
> 
> test: privileges security_label collate matview lock replica_identity
> 
> I'm fairly sure that the index-related tests in this batch are I/O 
> intensive, and since they were not there at all six months ago, it's
> not hard to believe that this block of tests has far greater I/O
> demands than used to exist. Draw your own conclusions ...

Yes, that might be the culprit here. Would be interesting to know what's
happening on the machine while the tests are running to confirm this
hypothesis.

regards
Tomas

pgsql-hackers by date:

From: Tomas Vondra
Date: 26 December 2014, 00:52:55
Subject: Re: Better way of dealing with pgstat wait timeout during buildfarm runs?

From: Noah Misch
Date: 26 December 2014, 04:02:55
Subject: Re: Securing "make check" (CVE-2014-0067)

Re: Better way of dealing with pgstat wait timeout during buildfarm runs? - Mailing list pgsql-hackers

Previous

Next