Thread: [HACKERS] Continuous buildfarm failures on hamster with bin-check

[HACKERS] Continuous buildfarm failures on hamster with bin-check

From
Michael Paquier
Date:
Hi all,

Some of you may have noticed that hamster is heavily red on the
buildfarm. I have done a bit of investigation, and I am able to
reproduce the failure manually. But actually after looking at the logs
the error has obviously showed up:
2017-04-16 05:07:19.650 JST [18282] LOG:  database system is ready to
accept connections
2017-04-16 05:08:36.725 JST [18296] LOG:  using stale statistics
instead of current ones because stats collector is not responding
2017-04-16 05:10:22.207 JST [18303] t/010_pg_basebackup.pl LOG:
terminating walsender process due to replication timeout
2017-04-16 05:10:30.180 JST [18306] LOG:  using stale statistics
instead of current ones because stats collector is not responding

Stale regressions means that the system is just constrained so much
that things are timing out.

In order to avoid such failures with normal regression tests, I have
set up extra_config so as stats_temp_directory goes to a tmpfs to
avoid stale statistics, but it is not possible to set up that with the
TAP tests. I could always disable --enable-tap-tests on this machine
but I don't think that this is a correct answer. Something that could
be done is to use an environment variable to set up a base directory
for all the nodes created in PostgresNode.pm, and use that for
temporary statistics with a custom folder name. But that's not
scalable if we want to enforce more parameter.

Thoughts?
-- 
Michael



Re: [HACKERS] Continuous buildfarm failures on hamster with bin-check

From
Andres Freund
Date:
Hi,

On 2017-04-18 16:07:38 +0900, Michael Paquier wrote:
> Some of you may have noticed that hamster is heavily red on the
> buildfarm. I have done a bit of investigation, and I am able to
> reproduce the failure manually. But actually after looking at the logs
> the error has obviously showed up:
> 2017-04-16 05:07:19.650 JST [18282] LOG:  database system is ready to
> accept connections
> 2017-04-16 05:08:36.725 JST [18296] LOG:  using stale statistics
> instead of current ones because stats collector is not responding
> 2017-04-16 05:10:22.207 JST [18303] t/010_pg_basebackup.pl LOG:
> terminating walsender process due to replication timeout
> 2017-04-16 05:10:30.180 JST [18306] LOG:  using stale statistics
> instead of current ones because stats collector is not responding
> 
> Stale regressions means that the system is just constrained so much
> that things are timing out.
> 
> In order to avoid such failures with normal regression tests, I have
> set up extra_config so as stats_temp_directory goes to a tmpfs to
> avoid stale statistics

How high do you need to make the hardcoded limit for this to succeed
without a tmpfs?  If hamster fails this regularly I think we have to do
something about it, rather than paper over it.  What's the storage
situation currently like?

Greetings,

Andres Freund



Re: [HACKERS] Continuous buildfarm failures on hamster with bin-check

From
Michael Paquier
Date:
On Tue, Apr 18, 2017 at 4:15 PM, Andres Freund <andres@anarazel.de> wrote:
> Hi,
>
> On 2017-04-18 16:07:38 +0900, Michael Paquier wrote:
>> Some of you may have noticed that hamster is heavily red on the
>> buildfarm. I have done a bit of investigation, and I am able to
>> reproduce the failure manually. But actually after looking at the logs
>> the error has obviously showed up:
>> 2017-04-16 05:07:19.650 JST [18282] LOG:  database system is ready to
>> accept connections
>> 2017-04-16 05:08:36.725 JST [18296] LOG:  using stale statistics
>> instead of current ones because stats collector is not responding
>> 2017-04-16 05:10:22.207 JST [18303] t/010_pg_basebackup.pl LOG:
>> terminating walsender process due to replication timeout
>> 2017-04-16 05:10:30.180 JST [18306] LOG:  using stale statistics
>> instead of current ones because stats collector is not responding
>>
>> Stale regressions means that the system is just constrained so much
>> that things are timing out.
>>
>> In order to avoid such failures with normal regression tests, I have
>> set up extra_config so as stats_temp_directory goes to a tmpfs to
>> avoid stale statistics
>
> How high do you need to make the hardcoded limit for this to succeed
> without a tmpfs?

Increasing wal_sender_timeout helps visibly to reduce the failure
rate. With 10 attempts I can see before at least 3 failures, and
nothing after.

> If hamster fails this regularly I think we have to do
> something about it, rather than paper over it.  What's the storage
> situation currently like?

The SD card of this RPI is half-empty.
-- 
Michael



Re: [HACKERS] Continuous buildfarm failures on hamster with bin-check

From
Andrew Dunstan
Date:

On 04/18/2017 03:07 AM, Michael Paquier wrote:
> Hi all,
>
> Some of you may have noticed that hamster is heavily red on the
> buildfarm. I have done a bit of investigation, and I am able to
> reproduce the failure manually. But actually after looking at the logs
> the error has obviously showed up:
> 2017-04-16 05:07:19.650 JST [18282] LOG:  database system is ready to
> accept connections
> 2017-04-16 05:08:36.725 JST [18296] LOG:  using stale statistics
> instead of current ones because stats collector is not responding
> 2017-04-16 05:10:22.207 JST [18303] t/010_pg_basebackup.pl LOG:
> terminating walsender process due to replication timeout
> 2017-04-16 05:10:30.180 JST [18306] LOG:  using stale statistics
> instead of current ones because stats collector is not responding
>
> Stale regressions means that the system is just constrained so much
> that things are timing out.
>
> In order to avoid such failures with normal regression tests, I have
> set up extra_config so as stats_temp_directory goes to a tmpfs to
> avoid stale statistics,


Yeah, but the way you have done it could also to lead to errors unless
you're very careful, as I found on axolotl (which died recently,
unfortunately). There I had to set the stats_temp directory to a
branch-specific name so a crash on branch A didn't leave stats picked up
by a run on branch B. I had a script that ran before every buildfarm run
that made sure the branch-specific directories existed on the tmpfs.



>  but it is not possible to set up that with the
> TAP tests. I could always disable --enable-tap-tests on this machine
> but I don't think that this is a correct answer. Something that could
> be done is to use an environment variable to set up a base directory
> for all the nodes created in PostgresNode.pm, and use that for
> temporary statistics with a custom folder name. But that's not
> scalable if we want to enforce more parameter.
>
>

What more parameters do you want?

cheers

andrew


-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [HACKERS] Continuous buildfarm failures on hamster with bin-check

From
Michael Paquier
Date:
On Tue, Apr 18, 2017 at 9:13 PM, Andrew Dunstan
<andrew.dunstan@2ndquadrant.com> wrote:
> Yeah, but the way you have done it could also to lead to errors unless
> you're very careful, as I found on axolotl (which died recently,
> unfortunately). There I had to set the stats_temp directory to a
> branch-specific name so a crash on branch A didn't leave stats picked up
> by a run on branch B. I had a script that ran before every buildfarm run
> that made sure the branch-specific directories existed on the tmpfs.

I didn't consider that. Still hamster is only set to run master so
that's not a problem. Running more branches would be too much for this
little dear as it includes TAP tests.

>>  but it is not possible to set up that with the
>> TAP tests. I could always disable --enable-tap-tests on this machine
>> but I don't think that this is a correct answer. Something that could
>> be done is to use an environment variable to set up a base directory
>> for all the nodes created in PostgresNode.pm, and use that for
>> temporary statistics with a custom folder name. But that's not
>> scalable if we want to enforce more parameter.
>
> What more parameters do you want?

Increasing wal_sender_timeout and wal_receiver_timeout can help in
reducing the failures seen.
-- 
Michael



Re: [HACKERS] Continuous buildfarm failures on hamster with bin-check

From
Andrew Dunstan
Date:

On 04/18/2017 08:23 AM, Michael Paquier wrote:
> On Tue, Apr 18, 2017 at 9:13 PM, Andrew Dunstan
> <andrew.dunstan@2ndquadrant.com> wrote:
>> Yeah, but the way you have done it could also to lead to errors unless
>> you're very careful, as I found on axolotl (which died recently,
>> unfortunately). There I had to set the stats_temp directory to a
>> branch-specific name so a crash on branch A didn't leave stats picked up
>> by a run on branch B. I had a script that ran before every buildfarm run
>> that made sure the branch-specific directories existed on the tmpfs.
> I didn't consider that. Still hamster is only set to run master so
> that's not a problem. Running more branches would be too much for this
> little dear as it includes TAP tests.
>
>>>  but it is not possible to set up that with the
>>> TAP tests. I could always disable --enable-tap-tests on this machine
>>> but I don't think that this is a correct answer. Something that could
>>> be done is to use an environment variable to set up a base directory
>>> for all the nodes created in PostgresNode.pm, and use that for
>>> temporary statistics with a custom folder name. But that's not
>>> scalable if we want to enforce more parameter.
>> What more parameters do you want?
> Increasing wal_sender_timeout and wal_receiver_timeout can help in
> reducing the failures seen.


OK, but you're only talking about a handful of these, right?

Lets's say we have a bunch of possible environment settings with names
that all begin with "PG_TAP_" PostgresNode.pm could check for the
existence of these and take action accordingly, and you could set them
on a buildfarm animal in the config file, or for interactive use in your
.profile.

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [HACKERS] Continuous buildfarm failures on hamster with bin-check

From
Michael Paquier
Date:
On Tue, Apr 18, 2017 at 9:35 PM, Andrew Dunstan
<andrew.dunstan@2ndquadrant.com> wrote:
>
>
> On 04/18/2017 08:23 AM, Michael Paquier wrote:
>> On Tue, Apr 18, 2017 at 9:13 PM, Andrew Dunstan
>> <andrew.dunstan@2ndquadrant.com> wrote:
>>> Yeah, but the way you have done it could also to lead to errors unless
>>> you're very careful, as I found on axolotl (which died recently,
>>> unfortunately). There I had to set the stats_temp directory to a
>>> branch-specific name so a crash on branch A didn't leave stats picked up
>>> by a run on branch B. I had a script that ran before every buildfarm run
>>> that made sure the branch-specific directories existed on the tmpfs.
>> I didn't consider that. Still hamster is only set to run master so
>> that's not a problem. Running more branches would be too much for this
>> little dear as it includes TAP tests.
>>
>>>>  but it is not possible to set up that with the
>>>> TAP tests. I could always disable --enable-tap-tests on this machine
>>>> but I don't think that this is a correct answer. Something that could
>>>> be done is to use an environment variable to set up a base directory
>>>> for all the nodes created in PostgresNode.pm, and use that for
>>>> temporary statistics with a custom folder name. But that's not
>>>> scalable if we want to enforce more parameter.
>>> What more parameters do you want?
>> Increasing wal_sender_timeout and wal_receiver_timeout can help in
>> reducing the failures seen.
>
> OK, but you're only talking about a handful of these, right?

Yup, that would be one solution but that's not attacking the problem
at its root.

> Lets's say we have a bunch of possible environment settings with names
> that all begin with "PG_TAP_" PostgresNode.pm could check for the
> existence of these and take action accordingly, and you could set them
> on a buildfarm animal in the config file, or for interactive use in your
> .profile.

That's the point I am trying to make upthread: slow buildfarm animals
should have minimal impact on core code modifications. We could for
example have one environment variable that lists all the parameters to
modify in a single string and appends them at the end of
postgresql.conf. But honestly I don't think that this is necessary if
there is only one variable able to define a base directory for
temporary statistics as the real bottleneck comes from there at least
in the case of hamster. When initializing a node via PostgresNode.pm,
we would just check for this variable, and the init() routine just
creates a temporary folder in it, setting up temp_stats_path in
postgresql.conf.
-- 
Michael



Re: [HACKERS] Continuous buildfarm failures on hamster with bin-check

From
Andrew Dunstan
Date:

On 04/18/2017 08:59 AM, Michael Paquier wrote:
>> Lets's say we have a bunch of possible environment settings with names
>> that all begin with "PG_TAP_" PostgresNode.pm could check for the
>> existence of these and take action accordingly, and you could set them
>> on a buildfarm animal in the config file, or for interactive use in your
>> .profile.
> That's the point I am trying to make upthread: slow buildfarm animals
> should have minimal impact on core code modifications. We could for
> example have one environment variable that lists all the parameters to
> modify in a single string and appends them at the end of
> postgresql.conf. But honestly I don't think that this is necessary if
> there is only one variable able to define a base directory for
> temporary statistics as the real bottleneck comes from there at least
> in the case of hamster. When initializing a node via PostgresNode.pm,
> we would just check for this variable, and the init() routine just
> creates a temporary folder in it, setting up temp_stats_path in
> postgresql.conf.



How is that going to deal with your wal_*_timeout etc settings?

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [HACKERS] Continuous buildfarm failures on hamster with bin-check

From
Tom Lane
Date:
Michael Paquier <michael.paquier@gmail.com> writes:
> That's the point I am trying to make upthread: slow buildfarm animals
> should have minimal impact on core code modifications. We could for
> example have one environment variable that lists all the parameters to
> modify in a single string and appends them at the end of
> postgresql.conf. But honestly I don't think that this is necessary if
> there is only one variable able to define a base directory for
> temporary statistics as the real bottleneck comes from there at least
> in the case of hamster.

FWIW, I'm a bit suspicious of relocating the temp stats directory as
being a reliable fix for this.  It looks to me like hamster isn't that
much slower than gaur/pademelon, ie the same machine that was my primary
development machine for well over a decade, and on which I have NEVER
seen a "stats collector not responding" failure.  Plus, if hamster's
main storage is SSD, that seems unlikely to be slower than the spinning
rust in gaur/pademelon.  So I can't escape the suspicion that something
else is going on there.  Seemingly-unexplainable stats collector problems
have been a bugaboo for a long time ...
        regards, tom lane



Re: [HACKERS] Continuous buildfarm failures on hamster with bin-check

From
Alvaro Herrera
Date:
Tom Lane wrote:
> Michael Paquier <michael.paquier@gmail.com> writes:
> > That's the point I am trying to make upthread: slow buildfarm animals
> > should have minimal impact on core code modifications. We could for
> > example have one environment variable that lists all the parameters to
> > modify in a single string and appends them at the end of
> > postgresql.conf. But honestly I don't think that this is necessary if
> > there is only one variable able to define a base directory for
> > temporary statistics as the real bottleneck comes from there at least
> > in the case of hamster.
> 
> FWIW, I'm a bit suspicious of relocating the temp stats directory as
> being a reliable fix for this.  It looks to me like hamster isn't that
> much slower than gaur/pademelon, ie the same machine that was my primary
> development machine for well over a decade, and on which I have NEVER
> seen a "stats collector not responding" failure.  Plus, if hamster's
> main storage is SSD, that seems unlikely to be slower than the spinning
> rust in gaur/pademelon.  So I can't escape the suspicion that something
> else is going on there.  Seemingly-unexplainable stats collector problems
> have been a bugaboo for a long time ...

It's an SD card (the kind typically used in cameras and phones), not SSD.
Saying it's slow is an understatement.  It's *excruciatingly* slow.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Continuous buildfarm failures on hamster with bin-check

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Tom Lane wrote:
>> FWIW, I'm a bit suspicious of relocating the temp stats directory as
>> being a reliable fix for this.

> It's an SD card (the kind typically used in cameras and phones), not SSD.
> Saying it's slow is an understatement.  It's *excruciatingly* slow.

Oh, I misread it ... but still, the modern definition of "excruciatingly
slow" doesn't seem all that far off what 90s-era hard drives could do.
It is clear from googling though that there's an enormous performance
range in SD cards' random write performance, eg wikipedia's entry has
a link to

http://goughlui.com/2014/01/16/testing-sd-card-performance-round-up/

Seems like it's hard to judge this without knowing exactly which
SD card Michael has got in that thing.
        regards, tom lane



Re: [HACKERS] Continuous buildfarm failures on hamster with bin-check

From
Michael Paquier
Date:
On Wed, Apr 19, 2017 at 12:48 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>> Tom Lane wrote:
>>> FWIW, I'm a bit suspicious of relocating the temp stats directory as
>>> being a reliable fix for this.
>
>> It's an SD card (the kind typically used in cameras and phones), not SSD.
>> Saying it's slow is an understatement.  It's *excruciatingly* slow.
>
> Oh, I misread it ... but still, the modern definition of "excruciatingly
> slow" doesn't seem all that far off what 90s-era hard drives could do.
> It is clear from googling though that there's an enormous performance
> range in SD cards' random write performance, eg wikipedia's entry has
> a link to
>
> http://goughlui.com/2014/01/16/testing-sd-card-performance-round-up/
>
> Seems like it's hard to judge this without knowing exactly which
> SD card Michael has got in that thing.

Professional micro SD HC 8GB, with class 10, which is utterly slow in
the things I have tested up to now and of course I cannot find its
read/write properties. I had a SanDisk class 10 in it that died some
months ago, and that was way faster. RIP.
-- 
Michael



Re: [HACKERS] Continuous buildfarm failures on hamster with bin-check

From
Michael Paquier
Date:
On Wed, Apr 19, 2017 at 7:03 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Wed, Apr 19, 2017 at 12:48 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>>> Tom Lane wrote:
>>>> FWIW, I'm a bit suspicious of relocating the temp stats directory as
>>>> being a reliable fix for this.
>>
>>> It's an SD card (the kind typically used in cameras and phones), not SSD.
>>> Saying it's slow is an understatement.  It's *excruciatingly* slow.
>>
>> Oh, I misread it ... but still, the modern definition of "excruciatingly
>> slow" doesn't seem all that far off what 90s-era hard drives could do.
>> It is clear from googling though that there's an enormous performance
>> range in SD cards' random write performance, eg wikipedia's entry has
>> a link to
>>
>> http://goughlui.com/2014/01/16/testing-sd-card-performance-round-up/
>>
>> Seems like it's hard to judge this without knowing exactly which
>> SD card Michael has got in that thing.
>
> Professional micro SD HC 8GB, with class 10, which is utterly slow in
> the things I have tested up to now and of course I cannot find its
> read/write properties. I had a SanDisk class 10 in it that died some
> months ago, and that was way faster. RIP.

Until something happens, I have removed --enable-tap-tests from the
script of this buildfarm machine. There is no point to let it in red
without a clear resolution plan on the table first.
-- 
Michael



Re: [HACKERS] Continuous buildfarm failures on hamster with bin-check

From
Noah Misch
Date:
On Tue, Apr 18, 2017 at 09:59:26PM +0900, Michael Paquier wrote:
> On Tue, Apr 18, 2017 at 9:35 PM, Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote:
> > On 04/18/2017 08:23 AM, Michael Paquier wrote:
> >> Increasing wal_sender_timeout and wal_receiver_timeout can help in
> >> reducing the failures seen.
> >
> > OK, but you're only talking about a handful of these, right?
> 
> Yup, that would be one solution but that's not attacking the problem
> at its root.
> 
> > Lets's say we have a bunch of possible environment settings with names
> > that all begin with "PG_TAP_" PostgresNode.pm could check for the
> > existence of these and take action accordingly, and you could set them
> > on a buildfarm animal in the config file, or for interactive use in your
> > .profile.
> 
> That's the point I am trying to make upthread: slow buildfarm animals
> should have minimal impact on core code modifications. We could for
> example have one environment variable that lists all the parameters to
> modify in a single string and appends them at the end of
> postgresql.conf. But honestly I don't think that this is necessary if
> there is only one variable able to define a base directory for
> temporary statistics as the real bottleneck comes from there at least
> in the case of hamster. When initializing a node via PostgresNode.pm,
> we would just check for this variable, and the init() routine just
> creates a temporary folder in it, setting up temp_stats_path in
> postgresql.conf.

Each of the above approaches has fairly low impact on the code, so we should
use other criteria to choose.  I'd welcome a feature for augmenting every
postgresql.conf of every test suite (a generalization of "pg_regress
--temp-config", which has proven its value).  I can envision using it with
force_parallel_mode, default_transaction_isolation, log_*, wal_*_timeout,
autovacuum_naptime, and others.

Even for hamster, I'm skeptical that changing stats_temp_directory would
suffice.  Every hamster BinInstallCheck failure since 2017-02-13 had a "LOG:
terminating walsender process due to replication timeout".  Most, but not all,
of those replication timeouts followed a "LOG:  using stale statistics instead
of current ones because stats collector is not responding".  For the remaining
minority, I expect to eventually need wal_sender_timeout.  Example:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamster&dt=2017-02-24%2016%3A00%3A06