Thread: BUG #17695: Failed Assert in logical replication snapbuild.

BUG #17695: Failed Assert in logical replication snapbuild.

From

PG Bug reporting form

Date:

24 November 2022, 10:15:55

The following bug has been logged on the website:

Bug reference:      17695
Logged by:          施博文
Email address:      zxwsbg@qq.com
PostgreSQL version: 14.6
Operating system:   centos
Description:

In PG14 or higher version, I notice that SnapBuildRestore don't set
builder->next_phase_at=InvalidTransactionId .

But in SnapBuildSerialize function, the assert check this condition.

Assert(builder->next_phase_at == InvalidTransactionId);

This would cause some problems, and I have repeat it with the perl test case
which I would update later, the problem is following:

TRAP: FailedAssertion("builder->next_phase_at == InvalidTransactionId",
File: "snapbuild.c", Line: 1604, PID: 29974)
postgres: master: walsender postgres [local]
START_REPLICATION(ExceptionalCondition+0xb9)[0xb1c9bd]
postgres: master: walsender postgres [local] START_REPLICATION[0x8f548d]
postgres: master: walsender postgres [local]
START_REPLICATION(SnapBuildProcessRunningXacts+0x55)[0x8f4c5c]
postgres: master: walsender postgres [local] START_REPLICATION[0x8dd8be]
postgres: master: walsender postgres [local]
START_REPLICATION(LogicalDecodingProcessRecord+0xd1)[0x8dd243]
postgres: master: walsender postgres [local] START_REPLICATION[0x915eb1]
postgres: master: walsender postgres [local] START_REPLICATION[0x91520c]
postgres: master: walsender postgres [local] START_REPLICATION[0x913bdd]
postgres: master: walsender postgres [local]
START_REPLICATION(exec_replication_command+0x42c)[0x914593]
postgres: master: walsender postgres [local]
START_REPLICATION(PostgresMain+0x7be)[0x984ddf]
postgres: master: walsender postgres [local] START_REPLICATION[0x8c0d41]
postgres: master: walsender postgres [local] START_REPLICATION[0x8c06b3]
postgres: master: walsender postgres [local] START_REPLICATION[0x8bc9c4]
postgres: master: walsender postgres [local]
START_REPLICATION(PostmasterMain+0x117a)[0x8bc29b]
postgres: master: walsender postgres [local] START_REPLICATION[0x7bdaf9]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f815dd92555]
postgres: master: walsender postgres [local] START_REPLICATION[0x488d09]

回复：BUG #17695: Failed Assert in logical replication snapbuild.

From

"施博文"

Date:

24 November 2022, 10:19:17

I write this perl test case to reproduce problem, it needs both superuser and database name 'postgres'.

Also, I write the patch to fix that problem.

Thanks.

------------------ 原始邮件 ------------------

发件人: "zxwsbg" <noreply@postgresql.org>;

发送时间: 2022年11月24日(星期四) 晚上6:15

收件人: "pgsql-bugs"<pgsql-bugs@lists.postgresql.org>;

抄送: "施博文"<zxwsbg@qq.com>;

主题: BUG #17695: Failed Assert in logical replication snapbuild.

The following bug has been logged on the website:

Bug reference:      17695
Logged by:          施博文
Email address:      zxwsbg@qq.com
PostgreSQL version: 14.6
Operating system:   centos
Description:

In PG14 or higher version, I notice that SnapBuildRestore don't set
builder->next_phase_at=InvalidTransactionId .

But in SnapBuildSerialize function, the assert check this condition.

Assert(builder->next_phase_at == InvalidTransactionId);

This would cause some problems, and I have repeat it with the perl test case
which I would update later, the problem is following:

TRAP: FailedAssertion("builder->next_phase_at == InvalidTransactionId",
File: "snapbuild.c", Line: 1604, PID: 29974)
postgres: master: walsender postgres [local]
START_REPLICATION(ExceptionalCondition+0xb9)[0xb1c9bd]
postgres: master: walsender postgres [local] START_REPLICATION[0x8f548d]
postgres: master: walsender postgres [local]
START_REPLICATION(SnapBuildProcessRunningXacts+0x55)[0x8f4c5c]
postgres: master: walsender postgres [local] START_REPLICATION[0x8dd8be]
postgres: master: walsender postgres [local]
START_REPLICATION(LogicalDecodingProcessRecord+0xd1)[0x8dd243]
postgres: master: walsender postgres [local] START_REPLICATION[0x915eb1]
postgres: master: walsender postgres [local] START_REPLICATION[0x91520c]
postgres: master: walsender postgres [local] START_REPLICATION[0x913bdd]
postgres: master: walsender postgres [local]
START_REPLICATION(exec_replication_command+0x42c)[0x914593]
postgres: master: walsender postgres [local]
START_REPLICATION(PostgresMain+0x7be)[0x984ddf]
postgres: master: walsender postgres [local] START_REPLICATION[0x8c0d41]
postgres: master: walsender postgres [local] START_REPLICATION[0x8c06b3]
postgres: master: walsender postgres [local] START_REPLICATION[0x8bc9c4]
postgres: master: walsender postgres [local]
START_REPLICATION(PostmasterMain+0x117a)[0x8bc29b]
postgres: master: walsender postgres [local] START_REPLICATION[0x7bdaf9]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f815dd92555]
postgres: master: walsender postgres [local] START_REPLICATION[0x488d09]

Attachment

Re: BUG #17695: Failed Assert in logical replication snapbuild.

From

vignesh C

Date:

24 November 2022, 12:39:15

On Thu, 24 Nov 2022 at 15:58, 施博文 <zxwsbg@qq.com> wrote:
>
> I write this perl test case to reproduce problem, it needs both superuser and database name 'postgres'.
>
> Also, I write the patch to fix that problem.

Thanks for sharing the script. I was able to reproduce the problem in
PG 14.6 with your scripts. I was not able to reproduce the problem in
HEAD after making the required changes in the script for HEAD. I was
not sure if this issue is only in 14.6 or if the issue is present in
HEAD too. Since the Assert is there and code changes suggested by you
is not in HEAD, I thought this issue might be present in HEAD too.
Does it work for you on HEAD?

Regards,
Vignesh

回复： BUG #17695: Failed Assert in logical replication snapbuild.

From

"施博文"

Date:

24 November 2022, 13:09:56

My test branch is REL_14_STABLE in head, and the commit number is c93254424f288557eeef13343be8f72536cb9ffe

I think the problem still exists since SnapRestore function source code has not changed. This problem is hard to reproduce, may be you can try more times.

------------------ 原始邮件 ------------------

发件人: "vignesh C" <vignesh21@gmail.com>;

发送时间: 2022年11月24日(星期四) 晚上9:09

收件人: "施博文"<zxwsbg@qq.com>;

抄送: "pgsql-bugs"<pgsql-bugs@lists.postgresql.org>;

主题: Re: BUG #17695: Failed Assert in logical replication snapbuild.

On Thu, 24 Nov 2022 at 15:58, 施博文 <zxwsbg@qq.com> wrote:
>
> I write this perl test case to reproduce problem, it needs both superuser and database name 'postgres'.
>
> Also, I write the patch to fix that problem.

Thanks for sharing the script. I was able to reproduce the problem in
PG 14.6 with your scripts. I was not able to reproduce the problem in
HEAD after making the required changes in the script for HEAD. I was
not sure if this issue is only in 14.6 or if the issue is present in
HEAD too. Since the Assert is there and code changes suggested by you
is not in HEAD, I thought this issue might be present in HEAD too.
Does it work for you on HEAD?

Regards,
Vignesh

Re: BUG #17695: Failed Assert in logical replication snapbuild.

From

Masahiko Sawada

Date:

25 November 2022, 05:58:40

On Thu, Nov 24, 2022 at 7:28 PM PG Bug reporting form
<noreply@postgresql.org> wrote:
>
> The following bug has been logged on the website:
>
> Bug reference:      17695
> Logged by:          施博文
> Email address:      zxwsbg@qq.com
> PostgreSQL version: 14.6
> Operating system:   centos
> Description:
>
> In PG14 or higher version, I notice that SnapBuildRestore don't set
> builder->next_phase_at=InvalidTransactionId .
>
> But in SnapBuildSerialize function, the assert check this condition.
>
> Assert(builder->next_phase_at == InvalidTransactionId);
>
> This would cause some problems, and I have repeat it with the perl test case
> which I would update later, the problem is following:
>
> TRAP: FailedAssertion("builder->next_phase_at == InvalidTransactionId",
> File: "snapbuild.c", Line: 1604, PID: 29974)
> postgres: master: walsender postgres [local]
> START_REPLICATION(ExceptionalCondition+0xb9)[0xb1c9bd]
> postgres: master: walsender postgres [local] START_REPLICATION[0x8f548d]
> postgres: master: walsender postgres [local]
> START_REPLICATION(SnapBuildProcessRunningXacts+0x55)[0x8f4c5c]
> postgres: master: walsender postgres [local] START_REPLICATION[0x8dd8be]
> postgres: master: walsender postgres [local]
> START_REPLICATION(LogicalDecodingProcessRecord+0xd1)[0x8dd243]
> postgres: master: walsender postgres [local] START_REPLICATION[0x915eb1]
> postgres: master: walsender postgres [local] START_REPLICATION[0x91520c]
> postgres: master: walsender postgres [local] START_REPLICATION[0x913bdd]
> postgres: master: walsender postgres [local]
> START_REPLICATION(exec_replication_command+0x42c)[0x914593]
> postgres: master: walsender postgres [local]
> START_REPLICATION(PostgresMain+0x7be)[0x984ddf]
> postgres: master: walsender postgres [local] START_REPLICATION[0x8c0d41]
> postgres: master: walsender postgres [local] START_REPLICATION[0x8c06b3]
> postgres: master: walsender postgres [local] START_REPLICATION[0x8bc9c4]
> postgres: master: walsender postgres [local]
> START_REPLICATION(PostmasterMain+0x117a)[0x8bc29b]
> postgres: master: walsender postgres [local] START_REPLICATION[0x7bdaf9]
> /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f815dd92555]
> postgres: master: walsender postgres [local] START_REPLICATION[0x488d09]
>

Thank you for reporting the issue!

I could reproduce this issue with a small change and the following steps:

1. Add sleep after setting SNAPBUILD_BUILDING_SNAPSHOT.

@@ -1409,6 +1409,9 @@ SnapBuildFindSnapshot(SnapBuild *builder,
XLogRecPtr lsn, xl_running_xacts *runn
                           running->xcnt, running->nextXid)));

        SnapBuildWaitSnapshot(running, running->nextXid);
+
+       elog(LOG, "sleep 10s");
+       pg_usleep(10 * 1000000);
    }

2. Create and start database instance and create a table and the
replication slot 's1'.
create table test (i int);
select pg_logical_replication_slot('s1', 'test_decoding');

3. Create another replication slot "s2" while a transaction is running.

tx-1:
begin;
insert into test values (1);

    tx-2:
    select pg_logical_replication_slot('s2', 'test_decoding');

checkpoint;
commit;

Note that replication slot creation will wait for 10 sec after
commiting the tx-1, please go to the next step once the slot has been
created.

4. Start decoding on slot 's2'.
select pg_logical_slot_get_changes('s2', null, null);

The logical decoding will sleep. Please go to the next step after
seeing the log "sleep 10s".

5. While the logical decoding started at step 4 is sleeping, start
decoding on slot 's1'.
select pg_logical_slot_get_changes('s1', null, null);

6. After the decoding on slot 's1' wakes up, it fails due to the
assertion check.

This scenario simulates the case where the logical decoding (re)starts
where RUNNING_XACTS record having a running transactions and it
restores the serialized snapshot when decoding the next RUNNING_XACTS
record. Since we don't reset builder->next_phase_at when restoring a
serialized snapshot, the assertion check in SnapBuildSerialize fails.

Regarding the proposed patch, I've confirmed it fixes this issue. But
I think it's better to reset builder->next_phase_at right after the
following assertion check:

   /* consistent snapshots have no next phase */
   Assert(ondisk.builder.next_phase_at == InvalidTransactionId);

I could not reproduce this issue with your script in my environment. I
think it's better to include the reproducible test case in the patch
but I'm not sure how to do that without adding sleep/gdb attach.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

RE: BUG #17695: Failed Assert in logical replication snapbuild.

From

"Hayato Kuroda (Fujitsu)"

Date:

25 November 2022, 09:07:56

Dear  Bowenshi,

> Thanks for your advice! I update my test script, now it can 100% reproduce problem in my computer.

Thanks for updates. Note that even if I used the updated script in my environment (HEAD, 2cf41cd30),
I could not perfectly reproduce the issue. The publisher rarely crashed and output the core-file.

The failed assertion was same as you reported firstly[1], not newer one.

[1]: https://www.postgresql.org/message-id/17695-6be9277c9295985f%40postgresql.org 

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: BUG #17695: Failed Assert in logical replication snapbuild.

From

Masahiko Sawada

Date:

28 November 2022, 05:07:17

On Fri, Nov 25, 2022 at 4:44 PM bowenshi <zxwsbg@qq.com> wrote:
>
> Thanks for your advice! I update my test script, now it can 100% reproduce problem in my computer.

Hmm, I could not reproduce the issue even with the updated patch.

> The new script and fix code are both in the patch. However, I met new problem after adding fix code. It fails into a
newtrap following:
 
>
> TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(safeXid, snap->xmin)", File: "snapbuild.c", Line: 593, PID:
1576)
> postgres: master: walsender postgres [local] CREATE_REPLICATION_SLOT(ExceptionalCondition+0xb9)[0xb1ca28]
> postgres: master: walsender postgres [local] CREATE_REPLICATION_SLOT(SnapBuildInitialSnapshot+0x1b3)[0x8f3c79]
> postgres: master: walsender postgres [local] CREATE_REPLICATION_SLOT[0x9136b9]
> postgres: master: walsender postgres [local] CREATE_REPLICATION_SLOT(exec_replication_command+0x398)[0x91456a]
> postgres: master: walsender postgres [local] CREATE_REPLICATION_SLOT(PostgresMain+0x7be)[0x984e4a]
> postgres: master: walsender postgres [local] CREATE_REPLICATION_SLOT[0x8c0d5b]
> postgres: master: walsender postgres [local] CREATE_REPLICATION_SLOT[0x8c06cd]
> postgres: master: walsender postgres [local] CREATE_REPLICATION_SLOT[0x8bc9de]
> postgres: master: walsender postgres [local] CREATE_REPLICATION_SLOT(PostmasterMain+0x117a)[0x8bc2b5]
> postgres: master: walsender postgres [local] CREATE_REPLICATION_SLOT[0x7bdb13]
>

Testing this script in some environments, I sometimes get  this
assertion failure. I think this is the same issue that has been
reported in this thread[1]. I'll investigate this issue as well and
share my findings.

[1] https://www.postgresql.org/message-id/CAA4eK1L8wYcyTPxNzPGkhuO52WBGoOZbT0A73Le%3DZUWYAYmdfw%40mail.gmail.com

Regards,


-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: BUG #17695: Failed Assert in logical replication snapbuild.

From

Daniel Gustafsson

Date:

27 March 2023, 11:49:01

This thread has been idle for quite some time, and the test isn't running at
all, whereas there is more activity on the linked thread.  AFAICT from the
thread this issue is reproducible in HEAD so the test should be for master
right?  Should this be closed in favor of the other patch?

A quick glance at the failing test, assuming we want it for master:

+use TestLib;
+use PostgresNode;
This needs updating after b3b4d8e68ae which moved the test modules to a proper namespace.

+use DBI;
Not used in the test, and I can't see why it should be?

+use Test::More tests => 1;
We've moved away from explicit plans in favor of done_testing().

--
Daniel Gustafsson

Re: BUG #17695: Failed Assert in logical replication snapbuild.

From

Masahiko Sawada

Date:

17 May 2023, 05:34:35

On Mon, Mar 27, 2023 at 8:49 PM Daniel Gustafsson <daniel@yesql.se> wrote:
>
> This thread has been idle for quite some time, and the test isn't running at
> all, whereas there is more activity on the linked thread.  AFAICT from the
> thread this issue is reproducible in HEAD so the test should be for master
> right?  Should this be closed in favor of the other patch?
>

Reviewing the thread, one assertion failure was reported[1] and then
another assertion failure was reported[2]. But the later one was
already under discussion on another thread[3]. So I think we should
tackle the former one (i.e. originally reported issue) on this thread.

When it comes to the original issue, I already shared the reproducible
steps[4] and I've confirmed again with the steps that the issue still
happens on 14 or later and the patch . However I don't find a way to
reproduce it without sleep/gdb attach.

The fix is straightforward and it seems OK to me but the regression
test part might need more discussion.

Regards,

[1] https://www.postgresql.org/message-id/17695-6be9277c9295985f%40postgresql.org
[2] https://www.postgresql.org/message-id/tencent_7EB71DA5D7BA00EB0B429DCE45D0452B6406%40qq.com
[3] https://www.postgresql.org/message-id/CAA4eK1L8wYcyTPxNzPGkhuO52WBGoOZbT0A73Le=ZUWYAYmdfw@mail.gmail.com
[4] https://www.postgresql.org/message-id/CAD21AoCsxvV3Mpzv8Q3hG7CdQZ7vSBWyEFLgY3QZARRDcgzVwA%40mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

reset_next_phase_at.patch

Re: BUG #17695: Failed Assert in logical replication snapbuild.

From

Alexander Lakhin

Date:

18 May 2023, 14:00:00

Hello Sawada-san,

17.05.2023 08:34, Masahiko Sawada wrote:
>
> When it comes to the original issue, I already shared the reproducible
> steps[4] and I've confirmed again with the steps that the issue still
> happens on 14 or later and the patch . However I don't find a way to
> reproduce it without sleep/gdb attach.

I can easily (without gdb and sleep()) reproduce the issue on master with
the following script:
numclients=10
rm -rf contrib/test_decoding_*
for ((c=1;c<=numclients;c++)); do
   cp -r contrib/test_decoding contrib/test_decoding_$c
done

for ((c=1;c<=numclients;c++)); do
   EXTRA_REGRESS_OPTS="--dbname=regress_$c" make -s installcheck-force -C contrib/test_decoding_$c USE_MODULE_DB=1 
 >"installcheck-$c.log" 2>&1 &
done
wait

It leads to:
TRAP: failed Assert("builder->next_phase_at == InvalidTransactionId"), File: "snapbuild.c", Line: 1628, PID: 907918
...
2023-05-18 16:23:33.290 MSK [907502] LOG:  server process (PID 907918) was terminated by signal 6: Aborted
2023-05-18 16:23:33.290 MSK [907502] DETAIL:  Failed process was running: SELECT count(*) FROM 
pg_logical_slot_get_changes('regression_slot_stats1', NULL, NULL, 'skip-empty-xacts', '1');

...
Core was generated by `postgres: postgres regress_10 [local] SELECT                  '.
Program terminated with signal SIGABRT, Aborted.

warning: Section `.reg-xstate/907918' in core file too small.
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140405033059264) at ./nptl/pthread_kill.c:44
44      ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140405033059264) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140405033059264) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140405033059264, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007fb29a0cc476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007fb29a0b27f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x0000557371bd57bb in ExceptionalCondition (
     conditionName=conditionName@entry=0x557371d56860 "builder->next_phase_at == InvalidTransactionId",
     fileName=fileName@entry=0x557371d572e7 "snapbuild.c", lineNumber=lineNumber@entry=1628) at assert.c:66
#6  0x0000557371a28a29 in SnapBuildSerialize (builder=builder@entry=0x557372879158, lsn=lsn@entry=312723008)
     at snapbuild.c:1628
#7  0x0000557371a2a657 in SnapBuildProcessRunningXacts (builder=builder@entry=0x557372879158, lsn=312723008,
     running=running@entry=0x557373095190) at snapbuild.c:1230
...

If it would be helpful, I can reduce it to concrete sql queries.

Best regards,
Alexander

Re: BUG #17695: Failed Assert in logical replication snapbuild.

From

Masahiko Sawada

Date:

22 May 2023, 00:56:27

On Thu, May 18, 2023 at 11:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
>
> Hello Sawada-san,
>
> 17.05.2023 08:34, Masahiko Sawada wrote:
> >
> > When it comes to the original issue, I already shared the reproducible
> > steps[4] and I've confirmed again with the steps that the issue still
> > happens on 14 or later and the patch . However I don't find a way to
> > reproduce it without sleep/gdb attach.
>
> I can easily (without gdb and sleep()) reproduce the issue on master with
> the following script:
> numclients=10
> rm -rf contrib/test_decoding_*
> for ((c=1;c<=numclients;c++)); do
>    cp -r contrib/test_decoding contrib/test_decoding_$c
> done
>
> for ((c=1;c<=numclients;c++)); do
>    EXTRA_REGRESS_OPTS="--dbname=regress_$c" make -s installcheck-force -C contrib/test_decoding_$c USE_MODULE_DB=1
>  >"installcheck-$c.log" 2>&1 &
> done
> wait

Thank you for sharing the script. But it seems not stable as I could
not reproduce the issue in my environment. I think we need a stable
reproducer so that we can include it in core regression tests. Or it
may be okay not to include it if we could not find a convenient way
and the fix is trivial.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: BUG #17695: Failed Assert in logical replication snapbuild.

From

Alexander Lakhin

Date:

23 May 2023, 11:00:00

22.05.2023 03:56, Masahiko Sawada wrote:
> On Thu, May 18, 2023 at 11:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
>
>> I can easily (without gdb and sleep()) reproduce the issue on master with
>> the following script:
>> ...
> Thank you for sharing the script. But it seems not stable as I could
> not reproduce the issue in my environment. I think we need a stable
> reproducer so that we can include it in core regression tests. Or it
> may be okay not to include it if we could not find a convenient way
> and the fix is trivial.

I've came to the minimal reproducer:
numclients=40
for ((c=1;c<=numclients;c++)); do
createdb regress_$c
done

for ((c=1;c<=numclients;c++)); do
(
echo "
CREATE TABLE replication_example(id SERIAL PRIMARY KEY, somedata int, text varchar(120));
SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_$c', 'test_decoding');
SELECT data FROM pg_logical_slot_get_changes('regression_slot_$c', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts',

'1');
" | psql -d regress_$c >psql-$c.log
) &
done
wait
grep TRAP server.log

(I've set
fsync = off
wal_level = logical
in postgresql.conf)

When using a build made with ASAN (and gcc-12), I get several asserts at once:
grep TRAP server.log  | wc -l
12
Without ASAN, I get no failures with numclients = 40, but still get series of
those with numclients=80...

It's hardly suitable for the regression test, but it clearly demonstrates the
issue without using gdb. With the fix from [1] applied, I've got no failures,
even with numclients=100, for 10 runs.

I also think, that the fix is simple enough to be committed without a
complicated/resource-intensive regression test.

[1] https://www.postgresql.org/message-id/CAD21AoDNv09ZMr-E%2BfNzhduvkE6eK2fjCRA7wJHOhF8APH5JdQ%40mail.gmail.com

Best regards,
Alexander

Re: BUG #17695: Failed Assert in logical replication snapbuild.

From

Daniel Gustafsson

Date:

04 July 2023, 15:43:17

> On 23 May 2023, at 13:00, Alexander Lakhin <exclusion@gmail.com> wrote:
>
> 22.05.2023 03:56, Masahiko Sawada wrote:
>> On Thu, May 18, 2023 at 11:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
>>
>>> I can easily (without gdb and sleep()) reproduce the issue on master with
>>> the following script:
>>> ...
>> Thank you for sharing the script. But it seems not stable as I could
>> not reproduce the issue in my environment. I think we need a stable
>> reproducer so that we can include it in core regression tests. Or it
>> may be okay not to include it if we could not find a convenient way
>> and the fix is trivial.
>
> I've came to the minimal reproducer:

Thanks for the reproducer, I was able to reproduce this in HEAD and v16.

> It's hardly suitable for the regression test, but it clearly demonstrates the
> issue without using gdb. With the fix from [1] applied, I've got no failures,
> even with numclients=100, for 10 runs.
>
> I also think, that the fix is simple enough to be committed without a
> complicated/resource-intensive regression test.

I'm not convinced we need a regression test for this as it would be very
expensive and potentially brittle for older/slower buildfarm members while
giving few gains.

I've applied this to HEAD and backpatched it to v16.

--
Daniel Gustafsson

Re: BUG #17695: Failed Assert in logical replication snapbuild.

From

Masahiko Sawada

Date:

05 July 2023, 01:23:22

On Wed, Jul 5, 2023 at 12:43 AM Daniel Gustafsson <daniel@yesql.se> wrote:
>
> > On 23 May 2023, at 13:00, Alexander Lakhin <exclusion@gmail.com> wrote:
> >
> > 22.05.2023 03:56, Masahiko Sawada wrote:
> >> On Thu, May 18, 2023 at 11:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
> >>
> >>> I can easily (without gdb and sleep()) reproduce the issue on master with
> >>> the following script:
> >>> ...
> >> Thank you for sharing the script. But it seems not stable as I could
> >> not reproduce the issue in my environment. I think we need a stable
> >> reproducer so that we can include it in core regression tests. Or it
> >> may be okay not to include it if we could not find a convenient way
> >> and the fix is trivial.
> >
> > I've came to the minimal reproducer:
>
> Thanks for the reproducer, I was able to reproduce this in HEAD and v16.
>
> > It's hardly suitable for the regression test, but it clearly demonstrates the
> > issue without using gdb. With the fix from [1] applied, I've got no failures,
> > even with numclients=100, for 10 runs.
> >
> > I also think, that the fix is simple enough to be committed without a
> > complicated/resource-intensive regression test.
>
> I'm not convinced we need a regression test for this as it would be very
> expensive and potentially brittle for older/slower buildfarm members while
> giving few gains.
>
> I've applied this to HEAD and backpatched it to v16.

Thanks!

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: BUG #17695: Failed Assert in logical replication snapbuild.

From

Alexander Pyhalov

Date:

19 September 2023, 08:15:29

Daniel Gustafsson писал 2023-07-04 18:43:
>> On 23 May 2023, at 13:00, Alexander Lakhin <exclusion@gmail.com> 
>> wrote:
>> 
>> 22.05.2023 03:56, Masahiko Sawada wrote:
>>> On Thu, May 18, 2023 at 11:00 PM Alexander Lakhin 
>>> <exclusion@gmail.com> wrote:
>>> 
>>>> I can easily (without gdb and sleep()) reproduce the issue on master 
>>>> with
>>>> the following script:
>>>> ...
>>> Thank you for sharing the script. But it seems not stable as I could
>>> not reproduce the issue in my environment. I think we need a stable
>>> reproducer so that we can include it in core regression tests. Or it
>>> may be okay not to include it if we could not find a convenient way
>>> and the fix is trivial.
>> 
>> I've came to the minimal reproducer:
> 
> Thanks for the reproducer, I was able to reproduce this in HEAD and 
> v16.
> 
>> It's hardly suitable for the regression test, but it clearly 
>> demonstrates the
>> issue without using gdb. With the fix from [1] applied, I've got no 
>> failures,
>> even with numclients=100, for 10 runs.
>> 
>> I also think, that the fix is simple enough to be committed without a
>> complicated/resource-intensive regression test.
> 
> I'm not convinced we need a regression test for this as it would be 
> very
> expensive and potentially brittle for older/slower buildfarm members 
> while
> giving few gains.
> 
> I've applied this to HEAD and backpatched it to v16.
> 

Hi.
It seems we've managed to get this issue on 14.8. Is there any reason 
why it wasn't applied to earlier versions?
-- 
Best regards,
Alexander Pyhalov,
Postgres Professional

Re: BUG #17695: Failed Assert in logical replication snapbuild.

From

Daniel Gustafsson

Date:

30 April, 21:19:21

> On 19 Sep 2023, at 10:15, Alexander Pyhalov <a.pyhalov@postgrespro.ru> wrote:
> Daniel Gustafsson писал 2023-07-04 18:43:

>> I've applied this to HEAD and backpatched it to v16.
>
> It seems we've managed to get this issue on 14.8. Is there any reason why it wasn't applied to earlier versions?

(Revisiting a very old thread) This report was missed at the time, and it has
been off-list reported to me to have been identified in 14 and 15 as well, so I
plan on backpatching this fix to 14 and 15.

--
Daniel Gustafsson

Re: BUG #17695: Failed Assert in logical replication snapbuild.

From

Masahiko Sawada

Date:

30 April, 22:55:35

On Wed, Apr 30, 2025 at 11:19 AM Daniel Gustafsson <daniel@yesql.se> wrote:
>
> > On 19 Sep 2023, at 10:15, Alexander Pyhalov <a.pyhalov@postgrespro.ru> wrote:
> > Daniel Gustafsson писал 2023-07-04 18:43:
>
> >> I've applied this to HEAD and backpatched it to v16.
> >
> > It seems we've managed to get this issue on 14.8. Is there any reason why it wasn't applied to earlier versions?
>
> (Revisiting a very old thread) This report was missed at the time,

Yeah, that's bad.

> and it has
> been off-list reported to me to have been identified in 14 and 15 as well, so I
> plan on backpatching this fix to 14 and 15.

+1. I'd confirmed this issue existed on v14 or later[1].

Regards,

[1] https://www.postgresql.org/message-id/CAD21AoDNv09ZMr-E%2BfNzhduvkE6eK2fjCRA7wJHOhF8APH5JdQ%40mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com