Thread: [HACKERS] Continuous buildfarm failures on hamster with bin-check
Hi all, Some of you may have noticed that hamster is heavily red on the buildfarm. I have done a bit of investigation, and I am able to reproduce the failure manually. But actually after looking at the logs the error has obviously showed up: 2017-04-16 05:07:19.650 JST [18282] LOG: database system is ready to accept connections 2017-04-16 05:08:36.725 JST [18296] LOG: using stale statistics instead of current ones because stats collector is not responding 2017-04-16 05:10:22.207 JST [18303] t/010_pg_basebackup.pl LOG: terminating walsender process due to replication timeout 2017-04-16 05:10:30.180 JST [18306] LOG: using stale statistics instead of current ones because stats collector is not responding Stale regressions means that the system is just constrained so much that things are timing out. In order to avoid such failures with normal regression tests, I have set up extra_config so as stats_temp_directory goes to a tmpfs to avoid stale statistics, but it is not possible to set up that with the TAP tests. I could always disable --enable-tap-tests on this machine but I don't think that this is a correct answer. Something that could be done is to use an environment variable to set up a base directory for all the nodes created in PostgresNode.pm, and use that for temporary statistics with a custom folder name. But that's not scalable if we want to enforce more parameter. Thoughts? -- Michael
Hi, On 2017-04-18 16:07:38 +0900, Michael Paquier wrote: > Some of you may have noticed that hamster is heavily red on the > buildfarm. I have done a bit of investigation, and I am able to > reproduce the failure manually. But actually after looking at the logs > the error has obviously showed up: > 2017-04-16 05:07:19.650 JST [18282] LOG: database system is ready to > accept connections > 2017-04-16 05:08:36.725 JST [18296] LOG: using stale statistics > instead of current ones because stats collector is not responding > 2017-04-16 05:10:22.207 JST [18303] t/010_pg_basebackup.pl LOG: > terminating walsender process due to replication timeout > 2017-04-16 05:10:30.180 JST [18306] LOG: using stale statistics > instead of current ones because stats collector is not responding > > Stale regressions means that the system is just constrained so much > that things are timing out. > > In order to avoid such failures with normal regression tests, I have > set up extra_config so as stats_temp_directory goes to a tmpfs to > avoid stale statistics How high do you need to make the hardcoded limit for this to succeed without a tmpfs? If hamster fails this regularly I think we have to do something about it, rather than paper over it. What's the storage situation currently like? Greetings, Andres Freund
On Tue, Apr 18, 2017 at 4:15 PM, Andres Freund <andres@anarazel.de> wrote: > Hi, > > On 2017-04-18 16:07:38 +0900, Michael Paquier wrote: >> Some of you may have noticed that hamster is heavily red on the >> buildfarm. I have done a bit of investigation, and I am able to >> reproduce the failure manually. But actually after looking at the logs >> the error has obviously showed up: >> 2017-04-16 05:07:19.650 JST [18282] LOG: database system is ready to >> accept connections >> 2017-04-16 05:08:36.725 JST [18296] LOG: using stale statistics >> instead of current ones because stats collector is not responding >> 2017-04-16 05:10:22.207 JST [18303] t/010_pg_basebackup.pl LOG: >> terminating walsender process due to replication timeout >> 2017-04-16 05:10:30.180 JST [18306] LOG: using stale statistics >> instead of current ones because stats collector is not responding >> >> Stale regressions means that the system is just constrained so much >> that things are timing out. >> >> In order to avoid such failures with normal regression tests, I have >> set up extra_config so as stats_temp_directory goes to a tmpfs to >> avoid stale statistics > > How high do you need to make the hardcoded limit for this to succeed > without a tmpfs? Increasing wal_sender_timeout helps visibly to reduce the failure rate. With 10 attempts I can see before at least 3 failures, and nothing after. > If hamster fails this regularly I think we have to do > something about it, rather than paper over it. What's the storage > situation currently like? The SD card of this RPI is half-empty. -- Michael
On 04/18/2017 03:07 AM, Michael Paquier wrote: > Hi all, > > Some of you may have noticed that hamster is heavily red on the > buildfarm. I have done a bit of investigation, and I am able to > reproduce the failure manually. But actually after looking at the logs > the error has obviously showed up: > 2017-04-16 05:07:19.650 JST [18282] LOG: database system is ready to > accept connections > 2017-04-16 05:08:36.725 JST [18296] LOG: using stale statistics > instead of current ones because stats collector is not responding > 2017-04-16 05:10:22.207 JST [18303] t/010_pg_basebackup.pl LOG: > terminating walsender process due to replication timeout > 2017-04-16 05:10:30.180 JST [18306] LOG: using stale statistics > instead of current ones because stats collector is not responding > > Stale regressions means that the system is just constrained so much > that things are timing out. > > In order to avoid such failures with normal regression tests, I have > set up extra_config so as stats_temp_directory goes to a tmpfs to > avoid stale statistics, Yeah, but the way you have done it could also to lead to errors unless you're very careful, as I found on axolotl (which died recently, unfortunately). There I had to set the stats_temp directory to a branch-specific name so a crash on branch A didn't leave stats picked up by a run on branch B. I had a script that ran before every buildfarm run that made sure the branch-specific directories existed on the tmpfs. > but it is not possible to set up that with the > TAP tests. I could always disable --enable-tap-tests on this machine > but I don't think that this is a correct answer. Something that could > be done is to use an environment variable to set up a base directory > for all the nodes created in PostgresNode.pm, and use that for > temporary statistics with a custom folder name. But that's not > scalable if we want to enforce more parameter. > > What more parameters do you want? cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Apr 18, 2017 at 9:13 PM, Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote: > Yeah, but the way you have done it could also to lead to errors unless > you're very careful, as I found on axolotl (which died recently, > unfortunately). There I had to set the stats_temp directory to a > branch-specific name so a crash on branch A didn't leave stats picked up > by a run on branch B. I had a script that ran before every buildfarm run > that made sure the branch-specific directories existed on the tmpfs. I didn't consider that. Still hamster is only set to run master so that's not a problem. Running more branches would be too much for this little dear as it includes TAP tests. >> but it is not possible to set up that with the >> TAP tests. I could always disable --enable-tap-tests on this machine >> but I don't think that this is a correct answer. Something that could >> be done is to use an environment variable to set up a base directory >> for all the nodes created in PostgresNode.pm, and use that for >> temporary statistics with a custom folder name. But that's not >> scalable if we want to enforce more parameter. > > What more parameters do you want? Increasing wal_sender_timeout and wal_receiver_timeout can help in reducing the failures seen. -- Michael
On 04/18/2017 08:23 AM, Michael Paquier wrote: > On Tue, Apr 18, 2017 at 9:13 PM, Andrew Dunstan > <andrew.dunstan@2ndquadrant.com> wrote: >> Yeah, but the way you have done it could also to lead to errors unless >> you're very careful, as I found on axolotl (which died recently, >> unfortunately). There I had to set the stats_temp directory to a >> branch-specific name so a crash on branch A didn't leave stats picked up >> by a run on branch B. I had a script that ran before every buildfarm run >> that made sure the branch-specific directories existed on the tmpfs. > I didn't consider that. Still hamster is only set to run master so > that's not a problem. Running more branches would be too much for this > little dear as it includes TAP tests. > >>> but it is not possible to set up that with the >>> TAP tests. I could always disable --enable-tap-tests on this machine >>> but I don't think that this is a correct answer. Something that could >>> be done is to use an environment variable to set up a base directory >>> for all the nodes created in PostgresNode.pm, and use that for >>> temporary statistics with a custom folder name. But that's not >>> scalable if we want to enforce more parameter. >> What more parameters do you want? > Increasing wal_sender_timeout and wal_receiver_timeout can help in > reducing the failures seen. OK, but you're only talking about a handful of these, right? Lets's say we have a bunch of possible environment settings with names that all begin with "PG_TAP_" PostgresNode.pm could check for the existence of these and take action accordingly, and you could set them on a buildfarm animal in the config file, or for interactive use in your .profile. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Apr 18, 2017 at 9:35 PM, Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote: > > > On 04/18/2017 08:23 AM, Michael Paquier wrote: >> On Tue, Apr 18, 2017 at 9:13 PM, Andrew Dunstan >> <andrew.dunstan@2ndquadrant.com> wrote: >>> Yeah, but the way you have done it could also to lead to errors unless >>> you're very careful, as I found on axolotl (which died recently, >>> unfortunately). There I had to set the stats_temp directory to a >>> branch-specific name so a crash on branch A didn't leave stats picked up >>> by a run on branch B. I had a script that ran before every buildfarm run >>> that made sure the branch-specific directories existed on the tmpfs. >> I didn't consider that. Still hamster is only set to run master so >> that's not a problem. Running more branches would be too much for this >> little dear as it includes TAP tests. >> >>>> but it is not possible to set up that with the >>>> TAP tests. I could always disable --enable-tap-tests on this machine >>>> but I don't think that this is a correct answer. Something that could >>>> be done is to use an environment variable to set up a base directory >>>> for all the nodes created in PostgresNode.pm, and use that for >>>> temporary statistics with a custom folder name. But that's not >>>> scalable if we want to enforce more parameter. >>> What more parameters do you want? >> Increasing wal_sender_timeout and wal_receiver_timeout can help in >> reducing the failures seen. > > OK, but you're only talking about a handful of these, right? Yup, that would be one solution but that's not attacking the problem at its root. > Lets's say we have a bunch of possible environment settings with names > that all begin with "PG_TAP_" PostgresNode.pm could check for the > existence of these and take action accordingly, and you could set them > on a buildfarm animal in the config file, or for interactive use in your > .profile. That's the point I am trying to make upthread: slow buildfarm animals should have minimal impact on core code modifications. We could for example have one environment variable that lists all the parameters to modify in a single string and appends them at the end of postgresql.conf. But honestly I don't think that this is necessary if there is only one variable able to define a base directory for temporary statistics as the real bottleneck comes from there at least in the case of hamster. When initializing a node via PostgresNode.pm, we would just check for this variable, and the init() routine just creates a temporary folder in it, setting up temp_stats_path in postgresql.conf. -- Michael
On 04/18/2017 08:59 AM, Michael Paquier wrote: >> Lets's say we have a bunch of possible environment settings with names >> that all begin with "PG_TAP_" PostgresNode.pm could check for the >> existence of these and take action accordingly, and you could set them >> on a buildfarm animal in the config file, or for interactive use in your >> .profile. > That's the point I am trying to make upthread: slow buildfarm animals > should have minimal impact on core code modifications. We could for > example have one environment variable that lists all the parameters to > modify in a single string and appends them at the end of > postgresql.conf. But honestly I don't think that this is necessary if > there is only one variable able to define a base directory for > temporary statistics as the real bottleneck comes from there at least > in the case of hamster. When initializing a node via PostgresNode.pm, > we would just check for this variable, and the init() routine just > creates a temporary folder in it, setting up temp_stats_path in > postgresql.conf. How is that going to deal with your wal_*_timeout etc settings? cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Michael Paquier <michael.paquier@gmail.com> writes: > That's the point I am trying to make upthread: slow buildfarm animals > should have minimal impact on core code modifications. We could for > example have one environment variable that lists all the parameters to > modify in a single string and appends them at the end of > postgresql.conf. But honestly I don't think that this is necessary if > there is only one variable able to define a base directory for > temporary statistics as the real bottleneck comes from there at least > in the case of hamster. FWIW, I'm a bit suspicious of relocating the temp stats directory as being a reliable fix for this. It looks to me like hamster isn't that much slower than gaur/pademelon, ie the same machine that was my primary development machine for well over a decade, and on which I have NEVER seen a "stats collector not responding" failure. Plus, if hamster's main storage is SSD, that seems unlikely to be slower than the spinning rust in gaur/pademelon. So I can't escape the suspicion that something else is going on there. Seemingly-unexplainable stats collector problems have been a bugaboo for a long time ... regards, tom lane
Tom Lane wrote: > Michael Paquier <michael.paquier@gmail.com> writes: > > That's the point I am trying to make upthread: slow buildfarm animals > > should have minimal impact on core code modifications. We could for > > example have one environment variable that lists all the parameters to > > modify in a single string and appends them at the end of > > postgresql.conf. But honestly I don't think that this is necessary if > > there is only one variable able to define a base directory for > > temporary statistics as the real bottleneck comes from there at least > > in the case of hamster. > > FWIW, I'm a bit suspicious of relocating the temp stats directory as > being a reliable fix for this. It looks to me like hamster isn't that > much slower than gaur/pademelon, ie the same machine that was my primary > development machine for well over a decade, and on which I have NEVER > seen a "stats collector not responding" failure. Plus, if hamster's > main storage is SSD, that seems unlikely to be slower than the spinning > rust in gaur/pademelon. So I can't escape the suspicion that something > else is going on there. Seemingly-unexplainable stats collector problems > have been a bugaboo for a long time ... It's an SD card (the kind typically used in cameras and phones), not SSD. Saying it's slow is an understatement. It's *excruciatingly* slow. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Tom Lane wrote: >> FWIW, I'm a bit suspicious of relocating the temp stats directory as >> being a reliable fix for this. > It's an SD card (the kind typically used in cameras and phones), not SSD. > Saying it's slow is an understatement. It's *excruciatingly* slow. Oh, I misread it ... but still, the modern definition of "excruciatingly slow" doesn't seem all that far off what 90s-era hard drives could do. It is clear from googling though that there's an enormous performance range in SD cards' random write performance, eg wikipedia's entry has a link to http://goughlui.com/2014/01/16/testing-sd-card-performance-round-up/ Seems like it's hard to judge this without knowing exactly which SD card Michael has got in that thing. regards, tom lane
On Wed, Apr 19, 2017 at 12:48 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: >> Tom Lane wrote: >>> FWIW, I'm a bit suspicious of relocating the temp stats directory as >>> being a reliable fix for this. > >> It's an SD card (the kind typically used in cameras and phones), not SSD. >> Saying it's slow is an understatement. It's *excruciatingly* slow. > > Oh, I misread it ... but still, the modern definition of "excruciatingly > slow" doesn't seem all that far off what 90s-era hard drives could do. > It is clear from googling though that there's an enormous performance > range in SD cards' random write performance, eg wikipedia's entry has > a link to > > http://goughlui.com/2014/01/16/testing-sd-card-performance-round-up/ > > Seems like it's hard to judge this without knowing exactly which > SD card Michael has got in that thing. Professional micro SD HC 8GB, with class 10, which is utterly slow in the things I have tested up to now and of course I cannot find its read/write properties. I had a SanDisk class 10 in it that died some months ago, and that was way faster. RIP. -- Michael
On Wed, Apr 19, 2017 at 7:03 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Wed, Apr 19, 2017 at 12:48 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Alvaro Herrera <alvherre@2ndquadrant.com> writes: >>> Tom Lane wrote: >>>> FWIW, I'm a bit suspicious of relocating the temp stats directory as >>>> being a reliable fix for this. >> >>> It's an SD card (the kind typically used in cameras and phones), not SSD. >>> Saying it's slow is an understatement. It's *excruciatingly* slow. >> >> Oh, I misread it ... but still, the modern definition of "excruciatingly >> slow" doesn't seem all that far off what 90s-era hard drives could do. >> It is clear from googling though that there's an enormous performance >> range in SD cards' random write performance, eg wikipedia's entry has >> a link to >> >> http://goughlui.com/2014/01/16/testing-sd-card-performance-round-up/ >> >> Seems like it's hard to judge this without knowing exactly which >> SD card Michael has got in that thing. > > Professional micro SD HC 8GB, with class 10, which is utterly slow in > the things I have tested up to now and of course I cannot find its > read/write properties. I had a SanDisk class 10 in it that died some > months ago, and that was way faster. RIP. Until something happens, I have removed --enable-tap-tests from the script of this buildfarm machine. There is no point to let it in red without a clear resolution plan on the table first. -- Michael
On Tue, Apr 18, 2017 at 09:59:26PM +0900, Michael Paquier wrote: > On Tue, Apr 18, 2017 at 9:35 PM, Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote: > > On 04/18/2017 08:23 AM, Michael Paquier wrote: > >> Increasing wal_sender_timeout and wal_receiver_timeout can help in > >> reducing the failures seen. > > > > OK, but you're only talking about a handful of these, right? > > Yup, that would be one solution but that's not attacking the problem > at its root. > > > Lets's say we have a bunch of possible environment settings with names > > that all begin with "PG_TAP_" PostgresNode.pm could check for the > > existence of these and take action accordingly, and you could set them > > on a buildfarm animal in the config file, or for interactive use in your > > .profile. > > That's the point I am trying to make upthread: slow buildfarm animals > should have minimal impact on core code modifications. We could for > example have one environment variable that lists all the parameters to > modify in a single string and appends them at the end of > postgresql.conf. But honestly I don't think that this is necessary if > there is only one variable able to define a base directory for > temporary statistics as the real bottleneck comes from there at least > in the case of hamster. When initializing a node via PostgresNode.pm, > we would just check for this variable, and the init() routine just > creates a temporary folder in it, setting up temp_stats_path in > postgresql.conf. Each of the above approaches has fairly low impact on the code, so we should use other criteria to choose. I'd welcome a feature for augmenting every postgresql.conf of every test suite (a generalization of "pg_regress --temp-config", which has proven its value). I can envision using it with force_parallel_mode, default_transaction_isolation, log_*, wal_*_timeout, autovacuum_naptime, and others. Even for hamster, I'm skeptical that changing stats_temp_directory would suffice. Every hamster BinInstallCheck failure since 2017-02-13 had a "LOG: terminating walsender process due to replication timeout". Most, but not all, of those replication timeouts followed a "LOG: using stale statistics instead of current ones because stats collector is not responding". For the remaining minority, I expect to eventually need wal_sender_timeout. Example: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamster&dt=2017-02-24%2016%3A00%3A06