Re: Better way of dealing with pgstat wait timeout during buildfarm runs? - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Better way of dealing with pgstat wait timeout during buildfarm runs?
Date
Msg-id 54BE7D70.7050606@2ndquadrant.com
Whole thread Raw
In response to Re: Better way of dealing with pgstat wait timeout during buildfarm runs?  (Tomas Vondra <tv@fuzzy.cz>)
Responses Re: Better way of dealing with pgstat wait timeout during buildfarm runs?
List pgsql-hackers
On 25.12.2014 22:28, Tomas Vondra wrote:
> On 25.12.2014 21:14, Andres Freund wrote:
>
>> That's indeed odd. Seems to have been lost when the statsfile was
>> split into multiple files. Alvaro, Tomas?
> 
> The goal was to keep the logic as close to the original as possible.
> IIRC there were "pgstat wait timeout" issues before, and in most cases
> the conclusion was that it's probably because of overloaded I/O.
> 
> But maybe there actually was another bug, and it's entirely possible
> that the split introduced a new one, and that's what we're seeing now.
> The strange thing is that the split happened ~2 years ago, which is
> inconsistent with the sudden increase of this kind of issues. So maybe
> something changed on that particular animal (a failing SD card causing
> I/O stalls, perhaps)?
> 
> Anyway, I happen to have a spare Raspberry PI, so I'll try to reproduce
> and analyze the issue locally. But that won't happen until January.

I've tried to reproduce this on my Raspberry PI 'machine' and it's not
very difficult to trigger this. About 7 out of 10 'make check' runs fail
because of 'pgstat wait timeout'.

All the occurences I've seen were right after some sort of VACUUM
(sometimes plain, sometimes ANALYZE or FREEZE), and the I/O at the time
looked something like this:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
mmcblk0           0.00    75.00    0.00    8.00     0.00    36.00
9.00     5.73 15633.75    0.00 15633.75 125.00 100.00

So pretty terrible (this is a Class 4 SD card, supposedly able to handle
4 MB/s). If hamster had faulty SD card, it might have been much worse, I
guess.

This of course does not prove the absence of a bug - I plan to dig into
this a bit more. Feel free to point out some suspicious scenarios that
might be worth reproducing and analyzing.

-- 
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: B-Tree support function number 3 (strxfrm() optimization)
Next
From: Robert Haas
Date:
Subject: Re: Merging postgresql.conf and postgresql.auto.conf