Thread: test/modules/test_oat_hooks vs. debug_discard_caches=1

test/modules/test_oat_hooks vs. debug_discard_caches=1

From
Tom Lane
Date:
We realized today [1] that it's been some time since the buildfarm
had any debug_discard_caches (nee CLOBBER_CACHE_ALWAYS) coverage.
Sure enough, as soon as Tomas turned that back on, kaboom [2].
The test_oat_hooks test is failing --- it's not crashing, but
it's emitting more NOTICE lines than the expected output includes,
evidently as a result of the hooks getting invoked extra times
during cache reloads.  I can reproduce that here.

Maybe it was a poor design that these hooks were placed someplace
that's sensitive to that.  I dunno.  The only short-term solution
I can think of is to force debug_discard_caches to 0 within that
test script, which is annoying but feasible (since that module
only exists in v15+).

Thoughts, other proposals?

            regards, tom lane

[1] https://www.postgresql.org/message-id/6b52e783-1b32-e723-4311-0e433a5a5a75%40enterprisedb.com
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=avocet&dt=2022-11-18%2016%3A01%3A43



Re: test/modules/test_oat_hooks vs. debug_discard_caches=1

From
Andres Freund
Date:
Hi,

On 2022-11-18 15:55:34 -0500, Tom Lane wrote:
> We realized today [1] that it's been some time since the buildfarm
> had any debug_discard_caches (nee CLOBBER_CACHE_ALWAYS) coverage.

Do we know when it was covered last? I assume it's before the addition of
test_oat_hooks in 90efa2f5565?


> Sure enough, as soon as Tomas turned that back on, kaboom [2].
> The test_oat_hooks test is failing --- it's not crashing, but
> it's emitting more NOTICE lines than the expected output includes,
> evidently as a result of the hooks getting invoked extra times
> during cache reloads.  I can reproduce that here.

Did you already look into where those additional namespace searches are coming
from? There are case in which it is not unproblematic to have repeated
namespace searches due to the potential for races it opens up...

Greetings,

Andres Freund



Re: test/modules/test_oat_hooks vs. debug_discard_caches=1

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2022-11-18 15:55:34 -0500, Tom Lane wrote:
>> The test_oat_hooks test is failing --- it's not crashing, but
>> it's emitting more NOTICE lines than the expected output includes,
>> evidently as a result of the hooks getting invoked extra times
>> during cache reloads.  I can reproduce that here.

> Did you already look into where those additional namespace searches are coming
> from? There are case in which it is not unproblematic to have repeated
> namespace searches due to the potential for races it opens up...

I'm not sufficiently interested in that API to dig hard for details,
but in a first look it seemed like the extra reports were coming
from repeated executions of recomputeNamespacePath, which are
forced after a cache invalidation by NamespaceCallback.

            regards, tom lane



Re: test/modules/test_oat_hooks vs. debug_discard_caches=1

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2022-11-18 15:55:34 -0500, Tom Lane wrote:
>> We realized today [1] that it's been some time since the buildfarm
>> had any debug_discard_caches (nee CLOBBER_CACHE_ALWAYS) coverage.

> Do we know when it was covered last? I assume it's before the addition of
> test_oat_hooks in 90efa2f5565?

As far as that goes: some digging in the buildfarm DB says that avocet
last did a CCA run on 2021-10-22 and trilobite on 2021-10-24.  They
were then offline completely until 2022-02-10, and when they restarted
the runtimes were way too short to be CCA tests.

Seems like maybe we need a little more redundancy in this bunch of
buildfarm animals.

            regards, tom lane



Re: test/modules/test_oat_hooks vs. debug_discard_caches=1

From
Tomas Vondra
Date:

On 11/19/22 04:10, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
>> On 2022-11-18 15:55:34 -0500, Tom Lane wrote:
>>> We realized today [1] that it's been some time since the buildfarm
>>> had any debug_discard_caches (nee CLOBBER_CACHE_ALWAYS) coverage.
> 
>> Do we know when it was covered last? I assume it's before the addition of
>> test_oat_hooks in 90efa2f5565?
> 
> As far as that goes: some digging in the buildfarm DB says that avocet
> last did a CCA run on 2021-10-22 and trilobite on 2021-10-24.  They
> were then offline completely until 2022-02-10, and when they restarted
> the runtimes were way too short to be CCA tests.
> 

Yeah. I'll try setting up a better monitoring / alerting to notice
issues like this more promptly ... it's a bit tough, because IIRC the
gap 2021-10-22 - 2022-02-10 was due to the tests running, but getting
stuck for some reason. So it's not like the machine was off.

I wonder if it'd make sense to have some simple & optional alerting
based on how long ago the machine reported the last result. Send e-mail
if there was no report for a month or so would be enough.

> Seems like maybe we need a little more redundancy in this bunch of
> buildfarm animals.
> 

It's actually a bit worse than that, because both animals are on the
same machine. So avocet gets "stuck" -> trilobite is stuck too.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: test/modules/test_oat_hooks vs. debug_discard_caches=1

From
Andrew Dunstan
Date:
On 2022-11-19 Sa 05:34, Tomas Vondra wrote:
>
> I wonder if it'd make sense to have some simple & optional alerting
> based on how long ago the machine reported the last result. Send e-mail
> if there was no report for a month or so would be enough.


This has been part of the buildfarm for a very long time. See the alerts
section of the config file.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: test/modules/test_oat_hooks vs. debug_discard_caches=1

From
Tomas Vondra
Date:
On 11/19/22 14:51, Andrew Dunstan wrote:
> 
> On 2022-11-19 Sa 05:34, Tomas Vondra wrote:
>>
>> I wonder if it'd make sense to have some simple & optional alerting
>> based on how long ago the machine reported the last result. Send e-mail
>> if there was no report for a month or so would be enough.
> 
> 
> This has been part of the buildfarm for a very long time. See the alerts
> section of the config file.
> 

I'm aware of that, but those alerts are not quite what I was asking
about. Imagine the run gets stuck for whatever reason (like infinite
loop somewhere), or maybe the VM fails / gets inaccessible for whatever
reason, perhaps because of some sort of human error so that the cron
does not get run ...

I don't think alerting from the client would catch those cases, but
maybe it's a rare issue and I'm overthinking it.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: test/modules/test_oat_hooks vs. debug_discard_caches=1

From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
> On 11/19/22 14:51, Andrew Dunstan wrote:
>> On 2022-11-19 Sa 05:34, Tomas Vondra wrote:
>>> I wonder if it'd make sense to have some simple & optional alerting
>>> based on how long ago the machine reported the last result. Send e-mail
>>> if there was no report for a month or so would be enough.

>> This has been part of the buildfarm for a very long time. See the alerts
>> section of the config file.

> I don't think alerting from the client would catch those cases, but
> maybe it's a rare issue and I'm overthinking it.

Those alerts are sent by the buildfarm server, not the client.

That has a failure mode of its own: if an animal goes down hard,
the server is left with its last-seen alert setup.  The only
way to not get nagged permanently is to ask Andrew to intervene
manually.  (Ask me how I know.)

            regards, tom lane



Re: test/modules/test_oat_hooks vs. debug_discard_caches=1

From
Andrew Dunstan
Date:
On 2022-11-19 Sa 09:33, Tom Lane wrote:
> Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
>> On 11/19/22 14:51, Andrew Dunstan wrote:
>>> On 2022-11-19 Sa 05:34, Tomas Vondra wrote:
>>>> I wonder if it'd make sense to have some simple & optional alerting
>>>> based on how long ago the machine reported the last result. Send e-mail
>>>> if there was no report for a month or so would be enough.
>>> This has been part of the buildfarm for a very long time. See the alerts
>>> section of the config file.
>> I don't think alerting from the client would catch those cases, but
>> maybe it's a rare issue and I'm overthinking it.
> Those alerts are sent by the buildfarm server, not the client.
>
> That has a failure mode of its own: if an animal goes down hard,
> the server is left with its last-seen alert setup.  The only
> way to not get nagged permanently is to ask Andrew to intervene
> manually.  (Ask me how I know.)
>
>             


True for now. The next release will have a utility command to
disable/enable alerts. The required server changes have already been made.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com