Thread: test/modules/test_oat_hooks vs. debug_discard_caches=1
We realized today [1] that it's been some time since the buildfarm had any debug_discard_caches (nee CLOBBER_CACHE_ALWAYS) coverage. Sure enough, as soon as Tomas turned that back on, kaboom [2]. The test_oat_hooks test is failing --- it's not crashing, but it's emitting more NOTICE lines than the expected output includes, evidently as a result of the hooks getting invoked extra times during cache reloads. I can reproduce that here. Maybe it was a poor design that these hooks were placed someplace that's sensitive to that. I dunno. The only short-term solution I can think of is to force debug_discard_caches to 0 within that test script, which is annoying but feasible (since that module only exists in v15+). Thoughts, other proposals? regards, tom lane [1] https://www.postgresql.org/message-id/6b52e783-1b32-e723-4311-0e433a5a5a75%40enterprisedb.com [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=avocet&dt=2022-11-18%2016%3A01%3A43
Hi, On 2022-11-18 15:55:34 -0500, Tom Lane wrote: > We realized today [1] that it's been some time since the buildfarm > had any debug_discard_caches (nee CLOBBER_CACHE_ALWAYS) coverage. Do we know when it was covered last? I assume it's before the addition of test_oat_hooks in 90efa2f5565? > Sure enough, as soon as Tomas turned that back on, kaboom [2]. > The test_oat_hooks test is failing --- it's not crashing, but > it's emitting more NOTICE lines than the expected output includes, > evidently as a result of the hooks getting invoked extra times > during cache reloads. I can reproduce that here. Did you already look into where those additional namespace searches are coming from? There are case in which it is not unproblematic to have repeated namespace searches due to the potential for races it opens up... Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2022-11-18 15:55:34 -0500, Tom Lane wrote: >> The test_oat_hooks test is failing --- it's not crashing, but >> it's emitting more NOTICE lines than the expected output includes, >> evidently as a result of the hooks getting invoked extra times >> during cache reloads. I can reproduce that here. > Did you already look into where those additional namespace searches are coming > from? There are case in which it is not unproblematic to have repeated > namespace searches due to the potential for races it opens up... I'm not sufficiently interested in that API to dig hard for details, but in a first look it seemed like the extra reports were coming from repeated executions of recomputeNamespacePath, which are forced after a cache invalidation by NamespaceCallback. regards, tom lane
Andres Freund <andres@anarazel.de> writes: > On 2022-11-18 15:55:34 -0500, Tom Lane wrote: >> We realized today [1] that it's been some time since the buildfarm >> had any debug_discard_caches (nee CLOBBER_CACHE_ALWAYS) coverage. > Do we know when it was covered last? I assume it's before the addition of > test_oat_hooks in 90efa2f5565? As far as that goes: some digging in the buildfarm DB says that avocet last did a CCA run on 2021-10-22 and trilobite on 2021-10-24. They were then offline completely until 2022-02-10, and when they restarted the runtimes were way too short to be CCA tests. Seems like maybe we need a little more redundancy in this bunch of buildfarm animals. regards, tom lane
On 11/19/22 04:10, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: >> On 2022-11-18 15:55:34 -0500, Tom Lane wrote: >>> We realized today [1] that it's been some time since the buildfarm >>> had any debug_discard_caches (nee CLOBBER_CACHE_ALWAYS) coverage. > >> Do we know when it was covered last? I assume it's before the addition of >> test_oat_hooks in 90efa2f5565? > > As far as that goes: some digging in the buildfarm DB says that avocet > last did a CCA run on 2021-10-22 and trilobite on 2021-10-24. They > were then offline completely until 2022-02-10, and when they restarted > the runtimes were way too short to be CCA tests. > Yeah. I'll try setting up a better monitoring / alerting to notice issues like this more promptly ... it's a bit tough, because IIRC the gap 2021-10-22 - 2022-02-10 was due to the tests running, but getting stuck for some reason. So it's not like the machine was off. I wonder if it'd make sense to have some simple & optional alerting based on how long ago the machine reported the last result. Send e-mail if there was no report for a month or so would be enough. > Seems like maybe we need a little more redundancy in this bunch of > buildfarm animals. > It's actually a bit worse than that, because both animals are on the same machine. So avocet gets "stuck" -> trilobite is stuck too. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2022-11-19 Sa 05:34, Tomas Vondra wrote: > > I wonder if it'd make sense to have some simple & optional alerting > based on how long ago the machine reported the last result. Send e-mail > if there was no report for a month or so would be enough. This has been part of the buildfarm for a very long time. See the alerts section of the config file. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 11/19/22 14:51, Andrew Dunstan wrote: > > On 2022-11-19 Sa 05:34, Tomas Vondra wrote: >> >> I wonder if it'd make sense to have some simple & optional alerting >> based on how long ago the machine reported the last result. Send e-mail >> if there was no report for a month or so would be enough. > > > This has been part of the buildfarm for a very long time. See the alerts > section of the config file. > I'm aware of that, but those alerts are not quite what I was asking about. Imagine the run gets stuck for whatever reason (like infinite loop somewhere), or maybe the VM fails / gets inaccessible for whatever reason, perhaps because of some sort of human error so that the cron does not get run ... I don't think alerting from the client would catch those cases, but maybe it's a rare issue and I'm overthinking it. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Tomas Vondra <tomas.vondra@enterprisedb.com> writes: > On 11/19/22 14:51, Andrew Dunstan wrote: >> On 2022-11-19 Sa 05:34, Tomas Vondra wrote: >>> I wonder if it'd make sense to have some simple & optional alerting >>> based on how long ago the machine reported the last result. Send e-mail >>> if there was no report for a month or so would be enough. >> This has been part of the buildfarm for a very long time. See the alerts >> section of the config file. > I don't think alerting from the client would catch those cases, but > maybe it's a rare issue and I'm overthinking it. Those alerts are sent by the buildfarm server, not the client. That has a failure mode of its own: if an animal goes down hard, the server is left with its last-seen alert setup. The only way to not get nagged permanently is to ask Andrew to intervene manually. (Ask me how I know.) regards, tom lane
On 2022-11-19 Sa 09:33, Tom Lane wrote: > Tomas Vondra <tomas.vondra@enterprisedb.com> writes: >> On 11/19/22 14:51, Andrew Dunstan wrote: >>> On 2022-11-19 Sa 05:34, Tomas Vondra wrote: >>>> I wonder if it'd make sense to have some simple & optional alerting >>>> based on how long ago the machine reported the last result. Send e-mail >>>> if there was no report for a month or so would be enough. >>> This has been part of the buildfarm for a very long time. See the alerts >>> section of the config file. >> I don't think alerting from the client would catch those cases, but >> maybe it's a rare issue and I'm overthinking it. > Those alerts are sent by the buildfarm server, not the client. > > That has a failure mode of its own: if an animal goes down hard, > the server is left with its last-seen alert setup. The only > way to not get nagged permanently is to ask Andrew to intervene > manually. (Ask me how I know.) > > True for now. The next release will have a utility command to disable/enable alerts. The required server changes have already been made. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com