Re: Subscription tests fail under CLOBBER_CACHE_ALWAYS - Mailing list pgsql-hackers
From | Michael Paquier |
---|---|
Subject | Re: Subscription tests fail under CLOBBER_CACHE_ALWAYS |
Date | |
Msg-id | YKSAFhtltIr9GUEi@paquier.xyz Whole thread Raw |
In response to | Subscription tests fail under CLOBBER_CACHE_ALWAYS (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Subscription tests fail under CLOBBER_CACHE_ALWAYS
Re: Subscription tests fail under CLOBBER_CACHE_ALWAYS Re: Subscription tests fail under CLOBBER_CACHE_ALWAYS |
List | pgsql-hackers |
On Tue, May 18, 2021 at 07:42:08PM -0400, Tom Lane wrote: > I count three distinct bugs that were exposed by this attempt: > > 1. In the part of 013_partition.pl that tests firing AFTER > triggers on partitioned tables, we have a case of continuing > to access a relcache entry that's already been closed. > (I'm not quite sure why prion's -DRELCACHE_FORCE_RELEASE > hasn't exposed this.) It looks to me like instead we had > a relcache reference leak before f3b141c48, but now, the > only relcache reference count on a partition child table > is dropped by ExecCleanupTupleRouting, which logical/worker.c > invokes before it fires triggers on that table. Kaboom. > This might go away if worker.c weren't so creatively different > from the other code paths concerned with executor shutdown. The tuple routing has made the whole worker logic messier by a larger degree compared to when this stuff was only able to apply DMLs changes on the partition leaves. I know that it is not that great to be more creative here, but we need to make sure that AfterTriggerEndQuery() is moved before ExecCleanupTupleRouting(). We could either keep the ExecCleanupTupleRouting() calls as they are now, and call AfterTriggerEndQuery() in more code paths. Or we could have one PartitionTupleRouting and one ModifyTableState created beforehand in create_estate_for_relation() if applying the change on a partitioned table but that means manipulating more structures across more layers of this code. Something like the attached fixes the problem for me, but honestly it does not help in clarifying this code more. I was not patient enough to wait for CLOBBER_CACHE_ALWAYS to initialize the nodes in the TAP tests, so I have tested that with a setup initialized with a non-clobber build, and reproduced the problem with CLOBBER_CACHE_ALWAYS builds on those same nodes. You are right that this is not a problem of 14~. I can reproduce the problem on 13 as well, and we have no coverage for tuple routing with triggers on this branch, so this would never have been stressed in the buildfarm. There is a good argument to be made here in cherry-picking 2ecfeda3 to REL_13_STABLE. > 2. Said bug causes a segfault in the apply worker process. > This causes the parent postmaster to give up and die. > I don't understand why we don't treat that like a crash > in a regular backend, considering that an apply worker > is running largely user-defined code. CleanupBackgroundWorker() and CleanupBackend() have a lot of common points. Are you referring to an inconsistent behavior with restart_after_crash that gets ignored for bgworkers? We disable it by default in the TAP tests. > 3. Once the subscriber1 postmaster has exited, the TAP > test will eventually time out, and then this happens: > > [.. logs ..] > > That is, because we failed to shut down subscriber1, the > test script neglects to shut down subscriber2, and now > things just sit indefinitely. So that's a robustness > problem in the TAP infrastructure, rather than a bug in > PG proper; but I still say it's a bug that needs fixing. This one comes down to teardown_node() that uses system_or_bail(), leaving things unfinished. I guess that we could be more aggressive and ignore failures if we have a non-zero error code and that not all the tests have passed within the END block of PostgresNode.pm. -- Michael
Attachment
pgsql-hackers by date: