Thread: BUG #17695: Failed Assert in logical replication snapbuild.
The following bug has been logged on the website: Bug reference: 17695 Logged by: 施博文 Email address: zxwsbg@qq.com PostgreSQL version: 14.6 Operating system: centos Description: In PG14 or higher version, I notice that SnapBuildRestore don't set builder->next_phase_at=InvalidTransactionId . But in SnapBuildSerialize function, the assert check this condition. Assert(builder->next_phase_at == InvalidTransactionId); This would cause some problems, and I have repeat it with the perl test case which I would update later, the problem is following: TRAP: FailedAssertion("builder->next_phase_at == InvalidTransactionId", File: "snapbuild.c", Line: 1604, PID: 29974) postgres: master: walsender postgres [local] START_REPLICATION(ExceptionalCondition+0xb9)[0xb1c9bd] postgres: master: walsender postgres [local] START_REPLICATION[0x8f548d] postgres: master: walsender postgres [local] START_REPLICATION(SnapBuildProcessRunningXacts+0x55)[0x8f4c5c] postgres: master: walsender postgres [local] START_REPLICATION[0x8dd8be] postgres: master: walsender postgres [local] START_REPLICATION(LogicalDecodingProcessRecord+0xd1)[0x8dd243] postgres: master: walsender postgres [local] START_REPLICATION[0x915eb1] postgres: master: walsender postgres [local] START_REPLICATION[0x91520c] postgres: master: walsender postgres [local] START_REPLICATION[0x913bdd] postgres: master: walsender postgres [local] START_REPLICATION(exec_replication_command+0x42c)[0x914593] postgres: master: walsender postgres [local] START_REPLICATION(PostgresMain+0x7be)[0x984ddf] postgres: master: walsender postgres [local] START_REPLICATION[0x8c0d41] postgres: master: walsender postgres [local] START_REPLICATION[0x8c06b3] postgres: master: walsender postgres [local] START_REPLICATION[0x8bc9c4] postgres: master: walsender postgres [local] START_REPLICATION(PostmasterMain+0x117a)[0x8bc29b] postgres: master: walsender postgres [local] START_REPLICATION[0x7bdaf9] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f815dd92555] postgres: master: walsender postgres [local] START_REPLICATION[0x488d09]
I write this perl test case to reproduce problem, it needs both superuser and database name 'postgres'.
Also, I write the patch to fix that problem.
Thanks.
------------------ 原始邮件 ------------------
发件人: "zxwsbg" <noreply@postgresql.org>;
发送时间: 2022年11月24日(星期四) 晚上6:15
收件人: "pgsql-bugs"<pgsql-bugs@lists.postgresql.org>;
抄送: "施博文"<zxwsbg@qq.com>;
主题: BUG #17695: Failed Assert in logical replication snapbuild.
Bug reference: 17695
Logged by: 施博文
Email address: zxwsbg@qq.com
PostgreSQL version: 14.6
Operating system: centos
Description:
In PG14 or higher version, I notice that SnapBuildRestore don't set
builder->next_phase_at=InvalidTransactionId .
But in SnapBuildSerialize function, the assert check this condition.
Assert(builder->next_phase_at == InvalidTransactionId);
This would cause some problems, and I have repeat it with the perl test case
which I would update later, the problem is following:
TRAP: FailedAssertion("builder->next_phase_at == InvalidTransactionId",
File: "snapbuild.c", Line: 1604, PID: 29974)
postgres: master: walsender postgres [local]
START_REPLICATION(ExceptionalCondition+0xb9)[0xb1c9bd]
postgres: master: walsender postgres [local] START_REPLICATION[0x8f548d]
postgres: master: walsender postgres [local]
START_REPLICATION(SnapBuildProcessRunningXacts+0x55)[0x8f4c5c]
postgres: master: walsender postgres [local] START_REPLICATION[0x8dd8be]
postgres: master: walsender postgres [local]
START_REPLICATION(LogicalDecodingProcessRecord+0xd1)[0x8dd243]
postgres: master: walsender postgres [local] START_REPLICATION[0x915eb1]
postgres: master: walsender postgres [local] START_REPLICATION[0x91520c]
postgres: master: walsender postgres [local] START_REPLICATION[0x913bdd]
postgres: master: walsender postgres [local]
START_REPLICATION(exec_replication_command+0x42c)[0x914593]
postgres: master: walsender postgres [local]
START_REPLICATION(PostgresMain+0x7be)[0x984ddf]
postgres: master: walsender postgres [local] START_REPLICATION[0x8c0d41]
postgres: master: walsender postgres [local] START_REPLICATION[0x8c06b3]
postgres: master: walsender postgres [local] START_REPLICATION[0x8bc9c4]
postgres: master: walsender postgres [local]
START_REPLICATION(PostmasterMain+0x117a)[0x8bc29b]
postgres: master: walsender postgres [local] START_REPLICATION[0x7bdaf9]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f815dd92555]
postgres: master: walsender postgres [local] START_REPLICATION[0x488d09]
Attachment
On Thu, 24 Nov 2022 at 15:58, 施博文 <zxwsbg@qq.com> wrote: > > I write this perl test case to reproduce problem, it needs both superuser and database name 'postgres'. > > Also, I write the patch to fix that problem. Thanks for sharing the script. I was able to reproduce the problem in PG 14.6 with your scripts. I was not able to reproduce the problem in HEAD after making the required changes in the script for HEAD. I was not sure if this issue is only in 14.6 or if the issue is present in HEAD too. Since the Assert is there and code changes suggested by you is not in HEAD, I thought this issue might be present in HEAD too. Does it work for you on HEAD? Regards, Vignesh
My test branch is REL_14_STABLE in head, and the commit number is c93254424f288557eeef13343be8f72536cb9ffe
I think the problem still exists since SnapRestore function source code has not changed. This problem is hard to reproduce, may be you can try more times.
------------------ 原始邮件 ------------------
发件人: "vignesh C" <vignesh21@gmail.com>;
发送时间: 2022年11月24日(星期四) 晚上9:09
收件人: "施博文"<zxwsbg@qq.com>;
抄送: "pgsql-bugs"<pgsql-bugs@lists.postgresql.org>;
主题: Re: BUG #17695: Failed Assert in logical replication snapbuild.
>
> I write this perl test case to reproduce problem, it needs both superuser and database name 'postgres'.
>
> Also, I write the patch to fix that problem.
Thanks for sharing the script. I was able to reproduce the problem in
PG 14.6 with your scripts. I was not able to reproduce the problem in
HEAD after making the required changes in the script for HEAD. I was
not sure if this issue is only in 14.6 or if the issue is present in
HEAD too. Since the Assert is there and code changes suggested by you
is not in HEAD, I thought this issue might be present in HEAD too.
Does it work for you on HEAD?
Regards,
Vignesh
On Thu, Nov 24, 2022 at 7:28 PM PG Bug reporting form <noreply@postgresql.org> wrote: > > The following bug has been logged on the website: > > Bug reference: 17695 > Logged by: 施博文 > Email address: zxwsbg@qq.com > PostgreSQL version: 14.6 > Operating system: centos > Description: > > In PG14 or higher version, I notice that SnapBuildRestore don't set > builder->next_phase_at=InvalidTransactionId . > > But in SnapBuildSerialize function, the assert check this condition. > > Assert(builder->next_phase_at == InvalidTransactionId); > > This would cause some problems, and I have repeat it with the perl test case > which I would update later, the problem is following: > > TRAP: FailedAssertion("builder->next_phase_at == InvalidTransactionId", > File: "snapbuild.c", Line: 1604, PID: 29974) > postgres: master: walsender postgres [local] > START_REPLICATION(ExceptionalCondition+0xb9)[0xb1c9bd] > postgres: master: walsender postgres [local] START_REPLICATION[0x8f548d] > postgres: master: walsender postgres [local] > START_REPLICATION(SnapBuildProcessRunningXacts+0x55)[0x8f4c5c] > postgres: master: walsender postgres [local] START_REPLICATION[0x8dd8be] > postgres: master: walsender postgres [local] > START_REPLICATION(LogicalDecodingProcessRecord+0xd1)[0x8dd243] > postgres: master: walsender postgres [local] START_REPLICATION[0x915eb1] > postgres: master: walsender postgres [local] START_REPLICATION[0x91520c] > postgres: master: walsender postgres [local] START_REPLICATION[0x913bdd] > postgres: master: walsender postgres [local] > START_REPLICATION(exec_replication_command+0x42c)[0x914593] > postgres: master: walsender postgres [local] > START_REPLICATION(PostgresMain+0x7be)[0x984ddf] > postgres: master: walsender postgres [local] START_REPLICATION[0x8c0d41] > postgres: master: walsender postgres [local] START_REPLICATION[0x8c06b3] > postgres: master: walsender postgres [local] START_REPLICATION[0x8bc9c4] > postgres: master: walsender postgres [local] > START_REPLICATION(PostmasterMain+0x117a)[0x8bc29b] > postgres: master: walsender postgres [local] START_REPLICATION[0x7bdaf9] > /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f815dd92555] > postgres: master: walsender postgres [local] START_REPLICATION[0x488d09] > Thank you for reporting the issue! I could reproduce this issue with a small change and the following steps: 1. Add sleep after setting SNAPBUILD_BUILDING_SNAPSHOT. @@ -1409,6 +1409,9 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn running->xcnt, running->nextXid))); SnapBuildWaitSnapshot(running, running->nextXid); + + elog(LOG, "sleep 10s"); + pg_usleep(10 * 1000000); } 2. Create and start database instance and create a table and the replication slot 's1'. create table test (i int); select pg_logical_replication_slot('s1', 'test_decoding'); 3. Create another replication slot "s2" while a transaction is running. tx-1: begin; insert into test values (1); tx-2: select pg_logical_replication_slot('s2', 'test_decoding'); checkpoint; commit; Note that replication slot creation will wait for 10 sec after commiting the tx-1, please go to the next step once the slot has been created. 4. Start decoding on slot 's2'. select pg_logical_slot_get_changes('s2', null, null); The logical decoding will sleep. Please go to the next step after seeing the log "sleep 10s". 5. While the logical decoding started at step 4 is sleeping, start decoding on slot 's1'. select pg_logical_slot_get_changes('s1', null, null); 6. After the decoding on slot 's1' wakes up, it fails due to the assertion check. This scenario simulates the case where the logical decoding (re)starts where RUNNING_XACTS record having a running transactions and it restores the serialized snapshot when decoding the next RUNNING_XACTS record. Since we don't reset builder->next_phase_at when restoring a serialized snapshot, the assertion check in SnapBuildSerialize fails. Regarding the proposed patch, I've confirmed it fixes this issue. But I think it's better to reset builder->next_phase_at right after the following assertion check: /* consistent snapshots have no next phase */ Assert(ondisk.builder.next_phase_at == InvalidTransactionId); I could not reproduce this issue with your script in my environment. I think it's better to include the reproducible test case in the patch but I'm not sure how to do that without adding sleep/gdb attach. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
RE: BUG #17695: Failed Assert in logical replication snapbuild.
From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Bowenshi, > Thanks for your advice! I update my test script, now it can 100% reproduce problem in my computer. Thanks for updates. Note that even if I used the updated script in my environment (HEAD, 2cf41cd30), I could not perfectly reproduce the issue. The publisher rarely crashed and output the core-file. The failed assertion was same as you reported firstly[1], not newer one. [1]: https://www.postgresql.org/message-id/17695-6be9277c9295985f%40postgresql.org Best Regards, Hayato Kuroda FUJITSU LIMITED
On Fri, Nov 25, 2022 at 4:44 PM bowenshi <zxwsbg@qq.com> wrote: > > Thanks for your advice! I update my test script, now it can 100% reproduce problem in my computer. Hmm, I could not reproduce the issue even with the updated patch. > The new script and fix code are both in the patch. However, I met new problem after adding fix code. It fails into a newtrap following: > > TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(safeXid, snap->xmin)", File: "snapbuild.c", Line: 593, PID: 1576) > postgres: master: walsender postgres [local] CREATE_REPLICATION_SLOT(ExceptionalCondition+0xb9)[0xb1ca28] > postgres: master: walsender postgres [local] CREATE_REPLICATION_SLOT(SnapBuildInitialSnapshot+0x1b3)[0x8f3c79] > postgres: master: walsender postgres [local] CREATE_REPLICATION_SLOT[0x9136b9] > postgres: master: walsender postgres [local] CREATE_REPLICATION_SLOT(exec_replication_command+0x398)[0x91456a] > postgres: master: walsender postgres [local] CREATE_REPLICATION_SLOT(PostgresMain+0x7be)[0x984e4a] > postgres: master: walsender postgres [local] CREATE_REPLICATION_SLOT[0x8c0d5b] > postgres: master: walsender postgres [local] CREATE_REPLICATION_SLOT[0x8c06cd] > postgres: master: walsender postgres [local] CREATE_REPLICATION_SLOT[0x8bc9de] > postgres: master: walsender postgres [local] CREATE_REPLICATION_SLOT(PostmasterMain+0x117a)[0x8bc2b5] > postgres: master: walsender postgres [local] CREATE_REPLICATION_SLOT[0x7bdb13] > Testing this script in some environments, I sometimes get this assertion failure. I think this is the same issue that has been reported in this thread[1]. I'll investigate this issue as well and share my findings. [1] https://www.postgresql.org/message-id/CAA4eK1L8wYcyTPxNzPGkhuO52WBGoOZbT0A73Le%3DZUWYAYmdfw%40mail.gmail.com Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
This thread has been idle for quite some time, and the test isn't running at all, whereas there is more activity on the linked thread. AFAICT from the thread this issue is reproducible in HEAD so the test should be for master right? Should this be closed in favor of the other patch? A quick glance at the failing test, assuming we want it for master: +use TestLib; +use PostgresNode; This needs updating after b3b4d8e68ae which moved the test modules to a proper namespace. +use DBI; Not used in the test, and I can't see why it should be? +use Test::More tests => 1; We've moved away from explicit plans in favor of done_testing(). -- Daniel Gustafsson
On Mon, Mar 27, 2023 at 8:49 PM Daniel Gustafsson <daniel@yesql.se> wrote: > > This thread has been idle for quite some time, and the test isn't running at > all, whereas there is more activity on the linked thread. AFAICT from the > thread this issue is reproducible in HEAD so the test should be for master > right? Should this be closed in favor of the other patch? > Reviewing the thread, one assertion failure was reported[1] and then another assertion failure was reported[2]. But the later one was already under discussion on another thread[3]. So I think we should tackle the former one (i.e. originally reported issue) on this thread. When it comes to the original issue, I already shared the reproducible steps[4] and I've confirmed again with the steps that the issue still happens on 14 or later and the patch . However I don't find a way to reproduce it without sleep/gdb attach. The fix is straightforward and it seems OK to me but the regression test part might need more discussion. Regards, [1] https://www.postgresql.org/message-id/17695-6be9277c9295985f%40postgresql.org [2] https://www.postgresql.org/message-id/tencent_7EB71DA5D7BA00EB0B429DCE45D0452B6406%40qq.com [3] https://www.postgresql.org/message-id/CAA4eK1L8wYcyTPxNzPGkhuO52WBGoOZbT0A73Le=ZUWYAYmdfw@mail.gmail.com [4] https://www.postgresql.org/message-id/CAD21AoCsxvV3Mpzv8Q3hG7CdQZ7vSBWyEFLgY3QZARRDcgzVwA%40mail.gmail.com -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
Hello Sawada-san, 17.05.2023 08:34, Masahiko Sawada wrote: > > When it comes to the original issue, I already shared the reproducible > steps[4] and I've confirmed again with the steps that the issue still > happens on 14 or later and the patch . However I don't find a way to > reproduce it without sleep/gdb attach. I can easily (without gdb and sleep()) reproduce the issue on master with the following script: numclients=10 rm -rf contrib/test_decoding_* for ((c=1;c<=numclients;c++)); do cp -r contrib/test_decoding contrib/test_decoding_$c done for ((c=1;c<=numclients;c++)); do EXTRA_REGRESS_OPTS="--dbname=regress_$c" make -s installcheck-force -C contrib/test_decoding_$c USE_MODULE_DB=1 >"installcheck-$c.log" 2>&1 & done wait It leads to: TRAP: failed Assert("builder->next_phase_at == InvalidTransactionId"), File: "snapbuild.c", Line: 1628, PID: 907918 ... 2023-05-18 16:23:33.290 MSK [907502] LOG: server process (PID 907918) was terminated by signal 6: Aborted 2023-05-18 16:23:33.290 MSK [907502] DETAIL: Failed process was running: SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot_stats1', NULL, NULL, 'skip-empty-xacts', '1'); ... Core was generated by `postgres: postgres regress_10 [local] SELECT '. Program terminated with signal SIGABRT, Aborted. warning: Section `.reg-xstate/907918' in core file too small. #0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140405033059264) at ./nptl/pthread_kill.c:44 44 ./nptl/pthread_kill.c: No such file or directory. (gdb) bt #0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140405033059264) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=6, threadid=140405033059264) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=140405033059264, signo=signo@entry=6) at ./nptl/pthread_kill.c:89 #3 0x00007fb29a0cc476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #4 0x00007fb29a0b27f3 in __GI_abort () at ./stdlib/abort.c:79 #5 0x0000557371bd57bb in ExceptionalCondition ( conditionName=conditionName@entry=0x557371d56860 "builder->next_phase_at == InvalidTransactionId", fileName=fileName@entry=0x557371d572e7 "snapbuild.c", lineNumber=lineNumber@entry=1628) at assert.c:66 #6 0x0000557371a28a29 in SnapBuildSerialize (builder=builder@entry=0x557372879158, lsn=lsn@entry=312723008) at snapbuild.c:1628 #7 0x0000557371a2a657 in SnapBuildProcessRunningXacts (builder=builder@entry=0x557372879158, lsn=312723008, running=running@entry=0x557373095190) at snapbuild.c:1230 ... If it would be helpful, I can reduce it to concrete sql queries. Best regards, Alexander
On Thu, May 18, 2023 at 11:00 PM Alexander Lakhin <exclusion@gmail.com> wrote: > > Hello Sawada-san, > > 17.05.2023 08:34, Masahiko Sawada wrote: > > > > When it comes to the original issue, I already shared the reproducible > > steps[4] and I've confirmed again with the steps that the issue still > > happens on 14 or later and the patch . However I don't find a way to > > reproduce it without sleep/gdb attach. > > I can easily (without gdb and sleep()) reproduce the issue on master with > the following script: > numclients=10 > rm -rf contrib/test_decoding_* > for ((c=1;c<=numclients;c++)); do > cp -r contrib/test_decoding contrib/test_decoding_$c > done > > for ((c=1;c<=numclients;c++)); do > EXTRA_REGRESS_OPTS="--dbname=regress_$c" make -s installcheck-force -C contrib/test_decoding_$c USE_MODULE_DB=1 > >"installcheck-$c.log" 2>&1 & > done > wait Thank you for sharing the script. But it seems not stable as I could not reproduce the issue in my environment. I think we need a stable reproducer so that we can include it in core regression tests. Or it may be okay not to include it if we could not find a convenient way and the fix is trivial. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
22.05.2023 03:56, Masahiko Sawada wrote: > On Thu, May 18, 2023 at 11:00 PM Alexander Lakhin <exclusion@gmail.com> wrote: > >> I can easily (without gdb and sleep()) reproduce the issue on master with >> the following script: >> ... > Thank you for sharing the script. But it seems not stable as I could > not reproduce the issue in my environment. I think we need a stable > reproducer so that we can include it in core regression tests. Or it > may be okay not to include it if we could not find a convenient way > and the fix is trivial. I've came to the minimal reproducer: numclients=40 for ((c=1;c<=numclients;c++)); do createdb regress_$c done for ((c=1;c<=numclients;c++)); do ( echo " CREATE TABLE replication_example(id SERIAL PRIMARY KEY, somedata int, text varchar(120)); SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_$c', 'test_decoding'); SELECT data FROM pg_logical_slot_get_changes('regression_slot_$c', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); " | psql -d regress_$c >psql-$c.log ) & done wait grep TRAP server.log (I've set fsync = off wal_level = logical in postgresql.conf) When using a build made with ASAN (and gcc-12), I get several asserts at once: grep TRAP server.log | wc -l 12 Without ASAN, I get no failures with numclients = 40, but still get series of those with numclients=80... It's hardly suitable for the regression test, but it clearly demonstrates the issue without using gdb. With the fix from [1] applied, I've got no failures, even with numclients=100, for 10 runs. I also think, that the fix is simple enough to be committed without a complicated/resource-intensive regression test. [1] https://www.postgresql.org/message-id/CAD21AoDNv09ZMr-E%2BfNzhduvkE6eK2fjCRA7wJHOhF8APH5JdQ%40mail.gmail.com Best regards, Alexander
> On 23 May 2023, at 13:00, Alexander Lakhin <exclusion@gmail.com> wrote: > > 22.05.2023 03:56, Masahiko Sawada wrote: >> On Thu, May 18, 2023 at 11:00 PM Alexander Lakhin <exclusion@gmail.com> wrote: >> >>> I can easily (without gdb and sleep()) reproduce the issue on master with >>> the following script: >>> ... >> Thank you for sharing the script. But it seems not stable as I could >> not reproduce the issue in my environment. I think we need a stable >> reproducer so that we can include it in core regression tests. Or it >> may be okay not to include it if we could not find a convenient way >> and the fix is trivial. > > I've came to the minimal reproducer: Thanks for the reproducer, I was able to reproduce this in HEAD and v16. > It's hardly suitable for the regression test, but it clearly demonstrates the > issue without using gdb. With the fix from [1] applied, I've got no failures, > even with numclients=100, for 10 runs. > > I also think, that the fix is simple enough to be committed without a > complicated/resource-intensive regression test. I'm not convinced we need a regression test for this as it would be very expensive and potentially brittle for older/slower buildfarm members while giving few gains. I've applied this to HEAD and backpatched it to v16. -- Daniel Gustafsson
On Wed, Jul 5, 2023 at 12:43 AM Daniel Gustafsson <daniel@yesql.se> wrote: > > > On 23 May 2023, at 13:00, Alexander Lakhin <exclusion@gmail.com> wrote: > > > > 22.05.2023 03:56, Masahiko Sawada wrote: > >> On Thu, May 18, 2023 at 11:00 PM Alexander Lakhin <exclusion@gmail.com> wrote: > >> > >>> I can easily (without gdb and sleep()) reproduce the issue on master with > >>> the following script: > >>> ... > >> Thank you for sharing the script. But it seems not stable as I could > >> not reproduce the issue in my environment. I think we need a stable > >> reproducer so that we can include it in core regression tests. Or it > >> may be okay not to include it if we could not find a convenient way > >> and the fix is trivial. > > > > I've came to the minimal reproducer: > > Thanks for the reproducer, I was able to reproduce this in HEAD and v16. > > > It's hardly suitable for the regression test, but it clearly demonstrates the > > issue without using gdb. With the fix from [1] applied, I've got no failures, > > even with numclients=100, for 10 runs. > > > > I also think, that the fix is simple enough to be committed without a > > complicated/resource-intensive regression test. > > I'm not convinced we need a regression test for this as it would be very > expensive and potentially brittle for older/slower buildfarm members while > giving few gains. > > I've applied this to HEAD and backpatched it to v16. Thanks! Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Daniel Gustafsson писал 2023-07-04 18:43: >> On 23 May 2023, at 13:00, Alexander Lakhin <exclusion@gmail.com> >> wrote: >> >> 22.05.2023 03:56, Masahiko Sawada wrote: >>> On Thu, May 18, 2023 at 11:00 PM Alexander Lakhin >>> <exclusion@gmail.com> wrote: >>> >>>> I can easily (without gdb and sleep()) reproduce the issue on master >>>> with >>>> the following script: >>>> ... >>> Thank you for sharing the script. But it seems not stable as I could >>> not reproduce the issue in my environment. I think we need a stable >>> reproducer so that we can include it in core regression tests. Or it >>> may be okay not to include it if we could not find a convenient way >>> and the fix is trivial. >> >> I've came to the minimal reproducer: > > Thanks for the reproducer, I was able to reproduce this in HEAD and > v16. > >> It's hardly suitable for the regression test, but it clearly >> demonstrates the >> issue without using gdb. With the fix from [1] applied, I've got no >> failures, >> even with numclients=100, for 10 runs. >> >> I also think, that the fix is simple enough to be committed without a >> complicated/resource-intensive regression test. > > I'm not convinced we need a regression test for this as it would be > very > expensive and potentially brittle for older/slower buildfarm members > while > giving few gains. > > I've applied this to HEAD and backpatched it to v16. > Hi. It seems we've managed to get this issue on 14.8. Is there any reason why it wasn't applied to earlier versions? -- Best regards, Alexander Pyhalov, Postgres Professional