RE: long-standing data loss bug in initial sync of logical replication - Mailing list pgsql-hackers

From Zhijie Hou (Fujitsu)
Subject RE: long-standing data loss bug in initial sync of logical replication
Date
Msg-id OS0PR01MB571616A2C303FCED3CF3D67C94C92@OS0PR01MB5716.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: long-standing data loss bug in initial sync of logical replication  (Benoit Lobréau <benoit.lobreau@dalibo.com>)
Responses Re: long-standing data loss bug in initial sync of logical replication
List pgsql-hackers
On Friday, February 28, 2025 4:28 PM Benoit Lobréau <benoit.lobreau@dalibo.com> wrote:
> 
> It took me a while but I ran the test on my laptop with 20 runs per test. I asked
> for a dedicated server and will re-run the tests if/when I have it.
> 
> count of partitions |   Head (sec) |    Fix (sec) |    Degradation (%)
> ----------------------------------------------------------------------
> 1000                |       0,0265 |       0,028  |  5,66037735849054
> 5000                |       0,091  |       0,0945 |  3,84615384615385
> 10000               |       0,1795 |       0,1815 |  1,11420612813371
> 
>   Concurrent Txn |    Head (sec)    |    Patch (sec) | Degradation in %
>   ---------------------------------------------------------------------
>   50             |   0,1797647      |   0,1920949    |  6,85907744957
>   100            |   0,3693029      |   0,3823425    |  3,53086856344
>   500            |   1,62265755     |   1,91427485   | 17,97158617972
>   1000           |   3,01388635     |   3,57678295   | 18,67676928162
>   2000           |   7,0171877      |   6,4713304    |  8,43500897435
> 
> I'll try to run test2.pl later (right now it fails).
> 
> hope this helps.

Thank you for testing and sharing the data!

A nitpick with the data for the Concurrent Transaction (2000) case. The results
show that the HEAD's data appears worse than the patch data, which seems
unusual. However, I confirmed that the details in the attachment are as expected,
so, this seems to be a typo. (I assume you intended to use a
decimal point instead of a comma in the data like (8,43500...))

The data suggests some regression, slightly more than Shlok’s findings, but it
is still within an acceptable range for me. Since the test script builds a real
subscription for testing, the results might be affected by network and
replication factors, as Amit pointed out, we will share a new test script soon
that uses the SQL API xxx_get_changes() to test. It would be great if you could
verify the performance using the updated script as well.

Best Regards,
Hou zj

pgsql-hackers by date:

Previous
From: Jakub Wartak
Date:
Subject: Re: doc: Mention clock synchronization recommendation for hot_standby_feedback
Next
From: Tender Wang
Date:
Subject: Re: Anti join confusion