RE: long-standing data loss bug in initial sync of logical replication - Mailing list pgsql-hackers

From Hayato Kuroda (Fujitsu)
Subject RE: long-standing data loss bug in initial sync of logical replication
Date
Msg-id OSCPR01MB14966EB5F3B416E4689FB5A67F5D32@OSCPR01MB14966.jpnprd01.prod.outlook.com
Whole thread Raw
In response to RE: long-standing data loss bug in initial sync of logical replication  ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>)
Responses Re: long-standing data loss bug in initial sync of logical replication
List pgsql-hackers
Hi hackers,

> Our team (mainly Shlok) did a performance testing with several workloads. Let
> me
> share them on -hackers. We did it for master/REL_17 branches, and in this post
> master's one will be discussed.

I posted benchmark results for master [1]. In this post contains a result for
back branch, especially REL_17_STABLE.

The observed trend is the same as master's one: 
Frequent DDL for publishing tables can cause huge regression, but this is expected.
For other cases, it is small or does not exist.

Used source
===========
The base code was HEAD of REL_17_STABLE, and compared patch was v16.
The large difference is that master tries to preserve relsync caches as much as
possible, but REL_17_STABLE discards them more aggressively.
Please refer recent commit, 3abe9d and 588acf6.

Executed workloads were mostly same as master's case.

-----

Workload A: No DDL operation done in concurrent session
======================================
No regression was observed in the workload.

Concurrent txn     | Head (sec)   | Patch (sec)  | Degradation (%)
------------------ | ------------ | ------------ | ----------------
50                 | 0.013706     | 0.013398     | -2.2496
100                | 0.014811     | 0.014821     | 0.0698
500                | 0.018288     | 0.018318     | 0.1640
1000               | 0.022613     | 0.022622     | 0.0413
2000               | 0.031812     | 0.031891     | 0.2504


-----

Workload B: DDL is happening but is unrelated to publication
========================================
Small regression was observed when the concurrency was huge. Because the DDL
transaction would send inval messages to all the concurrent transactions.

Concurrent txn     | Head (sec)   | Patch (sec)  | Degradation (%)
------------------ | ------------ | ------------ | ----------------
50                 | 0.013159     | 0.013305     | 1.1120
100                | 0.014718     | 0.014725     | 0.0476
500                | 0.018134     | 0.019578     | 7.9628
1000               | 0.022762     | 0.025228     | 10.8324
2000               | 0.032326     | 0.035638     | 10.2467


-----

Workload C. DDL is happening on publication but on unrelated table
============================================
We did not run the workload because we expected this could be same results as D.
588acf6 is needed to optimize the workload.

-----

Workload D. DDL is happening on the related published table,
            and one insert is done per invalidation
=========================================
This workload had huge regression same as the master branch. This is expected
because distributed invalidation messages require all concurrent transactions
to rebuild relsync caches.

Concurrent txn     | Head (sec)   | Patch (sec)  | Degradation (%)
------------------ | ------------ | ------------ | ----------------
50                 | 0.013496     | 0.015588     | 15.5034
100                | 0.015112     | 0.018868     | 24.8517
500                | 0.018483     | 0.038714     | 109.4536
1000               | 0.023402     | 0.063735     | 172.3524
2000               | 0.031596     | 0.110860     | 250.8720


-----

Workload E. DDL is happening on the related published table,
            and 1000 inserts are done per invalidation
============================================
The regression seen by D. cannot be observed. This is same as master's case and
expected because decoding 1000 tuples requires much time.

Concurrent txn     | Head (sec)   | Patch (sec)  | Degradation (%)
------------------ | ------------ | ------------ | ----------------
50                 | 0.093019     | 0.108820     | 16.9869
100                | 0.188367     | 0.199621     | 5.9741
500                | 0.967896     | 0.970674     | 0.2870
1000               | 1.658552     | 1.803991     | 8.7691
2000               | 3.482935     | 3.682771     | 5.7376

[1]:
https://www.postgresql.org/message-id/OSCPR01MB149661EA973D65EBEC2B60D98F5D32%40OSCPR01MB14966.jpnprd01.prod.outlook.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED


pgsql-hackers by date:

Previous
From: Rushabh Lathia
Date:
Subject: Re: Support NOT VALID / VALIDATE constraint options for named NOT NULL constraints
Next
From: Alvaro Herrera
Date:
Subject: Re: Test to dump and restore objects left behind by regression