deadlock-soft isolation test is failing on newest Fedora - Mailing list pgsql-hackers
From | Alexander Lakhin |
---|---|
Subject | deadlock-soft isolation test is failing on newest Fedora |
Date | |
Msg-id | b25b15cd-a1b5-c0f1-3cc9-514acc3c23a6@gmail.com Whole thread Raw |
List | pgsql-hackers |
Hello hackers, I'd like to bring your attention to an interesting failure produced by caiman twice: [1] and [2]. As far as I can see, this animal runs on Fedora and gets OS updates on a daily basis. At the moment of the first failure it had kernel: 6.12.0-0.rc6.20241108git906bd684e4b1.55.fc42.x86_64, and at the second: 6.12.0-0.rc7.59.fc42.x86_64. I've created a VirtualBox (7.1.2) VM with Fedora 42 from Fedora-Server-dvd-x86_64-Rawhide-20241114.n.0.iso and reproduced the issue locally. On a fully updated OS, with kernel 6.12.0-0.rc7.59.fc42.x86_64 (I also tried 6.11.0-63.fc42.x86_64 and got the same failures), I run `make check -C src/test/isolation` in a loop with 100 deadlock-soft tests in isolation_schedule and see, for instance: ITERATION 81 ... ok 47 - deadlock-soft 279 ms not ok 48 - deadlock-soft 728711 ms ok 49 - deadlock-soft 1775 ms regression.diffs is the same as in the caiman logs. (Unfortunately, the buildfarm client doesn't enable log_lock_waits (and log_autovacuum_min_duration) for installcheck stages, unlike pg_regress.c, so those installcheck logs (inst/logfile) don't contain helpful information for this case.) But locally I can see the corresponding lines in postmaster.log: ... 2024-11-17 10:47:47.558 GMT [17448:12] isolation/deadlock-soft/d2 LOG: process 17448 avoided deadlock for AccessShareLock on relation 16661 of database 16384 by rearranging queue order after 20.549 ms 2024-11-17 10:47:58.019 GMT [17457:11] isolation/deadlock-soft/e1 LOG: process 17457 avoided deadlock for AccessExclusiveLock on relation 16667 of database 16384 by rearranging queue order after 10297.867 ms 2024-11-17 10:59:58.112 GMT [17502:12] isolation/deadlock-soft/d2 LOG: process 17502 avoided deadlock for AccessShareLock on relation 16673 of database 16384 by rearranging queue order after 1597.951 ms .. (That is, the test failed when not d2, but e1 avoided deadlock (I also saw such failures with d1).) And another case (produced on Fedora 41, with 6.11.7-300.fc41.x86_64): 2024-11-17 15:05:55.825 GMT [6281:8] isolation/deadlock-soft/e1 LOG: process 6281 avoided deadlock for AccessExclusiveLock on relation 16943 of database 16384 by rearranging queue order after 13092.228 ms 2024-11-17 15:05:55.825 GMT [6281:9] isolation/deadlock-soft/e1 DETAIL: Processes holding the lock: 6279, 6280. Wait queue: 6281. 2024-11-17 15:05:55.825 GMT [6281:10] isolation/deadlock-soft/e1 STATEMENT: LOCK TABLE a1 IN ACCESS EXCLUSIVE MODE; 2024-11-17 15:05:55.825 GMT [6281:11] isolation/deadlock-soft/e1 LOG: process 6281 still waiting for AccessExclusiveLock on relation 16943 of database 16384 after 13092.228 ms 2024-11-17 15:05:55.825 GMT [6281:12] isolation/deadlock-soft/e1 DETAIL: Processes holding the lock: 6279, 6280. Wait queue: 6281. 2024-11-17 15:05:55.825 GMT [6281:13] isolation/deadlock-soft/e1 STATEMENT: LOCK TABLE a1 IN ACCESS EXCLUSIVE MODE; 2024-11-17 15:06:22.932 GMT [6282:8] isolation/deadlock-soft/e2 LOG: process 6282 still waiting for AccessExclusiveLock on relation 16946 of database 16384 after 32320.210 ms 2024-11-17 15:06:22.932 GMT [6282:9] isolation/deadlock-soft/e2 DETAIL: Process holding the lock: 6280. Wait queue: 6282, 6279. 2024-11-17 15:06:22.932 GMT [6282:10] isolation/deadlock-soft/e2 STATEMENT: LOCK TABLE a2 IN ACCESS EXCLUSIVE MODE; 2024-11-17 15:07:06.540 GMT [6279:9] isolation/deadlock-soft/d1 LOG: process 6279 still waiting for AccessShareLock on relation 16946 of database 16384 after 75890.853 ms 2024-11-17 15:07:06.540 GMT [6279:10] isolation/deadlock-soft/d1 DETAIL: Process holding the lock: 6280. Wait queue: 6282, 6279. 2024-11-17 15:07:06.540 GMT [6279:11] isolation/deadlock-soft/d1 STATEMENT: LOCK TABLE a2 IN ACCESS SHARE MODE; 2024-11-17 15:10:01.258 GMT [5524:1] LOG: checkpoint starting: time 2024-11-17 15:11:50.775 GMT [6295:1] [unknown] LOG: connection received: host=[local] 2024-11-17 15:11:50.775 GMT [6279:12] isolation/deadlock-soft/d1 ERROR: canceling statement due to user request 2024-11-17 15:11:50.775 GMT [6279:13] isolation/deadlock-soft/d1 STATEMENT: LOCK TABLE a2 IN ACCESS SHARE MODE; ... This failure was accompanied by the following report in the system journal: Nov 17 15:07:01 localhost.localdomain kernel: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: Nov 17 15:07:01 localhost.localdomain kernel: rcu: 0-...!: (2 ticks this GP) idle=5f58/0/0x0 softirq=7659/7660 fqs=1 (false positive?) Nov 17 15:07:01 localhost.localdomain kernel: rcu: 2-...!: (1 GPs behind) idle=caa0/0/0x0 softirq=6821/6822 fqs=1 (false positive?) Nov 17 15:07:01 localhost.localdomain kernel: rcu: (detected by 8, t=60030 jiffies, g=26137, q=80 ncpus=12) Nov 17 15:07:01 localhost.localdomain kernel: Sending NMI from CPU 8 to CPUs 0: ... I saw such messages on Fedora 42 too, and it looks like there is no 1:1 correlation between such CPU stall reports and the test failures. Most likely, a stall reported only when it lasts more than 60 seconds, but the test can fail even with a shorter stall duration; on the other hand, even with a long stall the test can pass, depending on which session resolves a soft deadlock. So this looks like a kernel/OS issue, but I wonder, are these alternate soft deadlock resolutions legal in principle? And if so, maybe the test should expect them... [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=caiman&dt=2024-11-11%2016%3A02%3A04 [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=caiman&dt=2024-11-16%2004%3A24%3A24 Best regards, Alexander
pgsql-hackers by date: