RE: Conflict detection for update_deleted in logical replication - Mailing list pgsql-hackers

From Hayato Kuroda (Fujitsu)
Subject RE: Conflict detection for update_deleted in logical replication
Date
Msg-id OSCPR01MB1496663AED8EEC566074DFBC9F54CA@OSCPR01MB14966.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Conflict detection for update_deleted in logical replication  (Dilip Kumar <dilipbalaut@gmail.com>)
Responses Re: Conflict detection for update_deleted in logical replication
List pgsql-hackers
Dear hackers,

As a confirmation purpose, I did performance testing with four workloads
we did before.

Highlights
==========
The retests on the latest patch set v46 show results consistent with previous
observations:
 - There is no performance impact on the publisher side
 - There is no performance impact on the subscriber side, if the workload is
   running only on subscriber.
 - The performance is reduced on the subscriber side (TPS reduction (~50%) [Test-03])
   when retain_conflict_info=on and pgbench is running on both side. Because dead
   tuple retention for conflict detection. If high workloads on the publisher,
   the apply workers must wait for the amount of transactions with earlier
   timestamps to be applied and flushed before advancing the non-removable XID
   to remove dead tuples. 
 - Subscriber-side TPS improves when the workload on the publisher is reduced.
 - Performance on the subscriber can also be improved by tuning the
   max_conflict_retention_duration GUC properly.

Used source
===========
pgHead commit fd7d7b7191 + v46 patchset

Machine details
===============
Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz CPU(s) :88 cores, - 503 GiB RAM

01. pgbench on publisher
========================
The workload is mostly same as [1]. 

Workload:
 - Ran pgbench with 40 clients for the publisher.
 - The duration was 120s, and the measurement was repeated 10 times.

(pubtest.tar.gz can run the same workload)

Test Scenarios & Results:
 - pgHead : Median TPS = 39809.84925
 - pgHead + patch : Median TPS = 40102.88108

Observation:
 - No performance regression observed with the patch applied.
 - The results were consistent across runs.

Detailed Results Table:
  - each cell shows the TPS in each case.
  - patch(ON) means patched and retain_conflict_info=ON is set.

run#    pgHEAD         pgHead+patch(ON) 
1    40106.88834        40356.60039
2    39854.17244        40087.18077
3    39516.26983        40063.34688
4    39746.45715        40389.40549
5    40014.83857        40537.24
6    39819.26374        40016.78705
7    39800.43476        38774.9827
8    39884.2691        40163.35257
9    39753.11246        39902.02755
10    39427.2353        40118.58138
median    39809.84925        40102.88108

02. pgbench on subscriber
========================
The workload is mostly same as [2]. 

Workload:
 - Ran pgbench with 40 clients for the *subscriber*.
 - The duration was 120s, and the measurement was repeated 10 times.

(subtest.tar.gz can run the same workload)

Test Scenarios & Results:
 - pgHead : Median TPS = 41564.64591
 - pgHead + patch : Median TPS = 41083.09555

Observation:
 - No performance regression observed with the patch applied.
 - The results were consistent across runs.

Detailed Results Table:

run#    pgHEAD         pgHead+patch(ON)
1    41605.88999        41106.93126
2    41555.76448        40975.9575
3    41505.76161        41223.92841
4    41722.50373        41049.52787
5    41400.48427        41262.15085
6    41386.47969        41059.25985
7    41679.7485        40916.93053
8    41563.60036        41178.82461
9    41565.69145        41672.41773
10    41765.11049        40958.73512
median    41564.64591        41083.09555

03. pgbench on both sides
========================
The workload is mostly same as [3].

Workload:
 - Ran pgbench with 40 clients for the *both side*.
 - The duration was 120s, and the measurement was repeated 10 times.

(bothtest.tar.gz can run the same workload)

Test Scenarios & Results:
Publisher:
 - pgHead : Median TPS = 16799.67659
 - pgHead + patch : Median TPS = 17338.38423
Subscriber:
 - pgHead : Median TPS = 16552.60515
 - pgHead + patch : Median TPS = 8367.133693

Observation:
 - No performance regression observed on the publisher with the patch applied.
 - The performance is reduced on the subscriber side (TPS reduction (~50%)) due
   to dead tuple retention for the conflict detection

Detailed Results Table:

On publisher:
run#    pgHEAD         pgHead+patch(ON) 
1    16735.53391        17369.89325
2    16957.01458        17077.96864
3    16838.07008        17480.08206
4    16743.67772        17531.00493
5    16776.74723        17511.4314
6    16784.73354        17235.76573
7    16871.63841        17255.04538
8    16814.61964        17460.33946
9    16903.14424        17024.77703
10    16556.05636        17306.87522
median    16799.67659        17338.38423

On subscriber:
run#    pgHEAD     pgHead+patch(ON) 
1    16505.27302    8381.200661
2    16765.38292    8353.310973
3    16899.41055    8396.901652
4    16305.05353    8413.058805
5    16722.90536    8320.833085
6    16587.64864    8327.217432
7    16508.45076    8369.205438
8    16357.05337    8394.34603
9    16724.90296    8351.718212
10    16517.56167    8365.061948
median    16552.60515    8367.133693

04. pgbench on both side, and max_conflict_retention_duration was tuned
========================================================================
The workload is mostly same as [4].

Workload:
- Initially ran pgbench with 40 clients for the *both side*.
- Set max_conflict_retention_duration = {60, 120}
- When the slot is invalidated on the subscriber side, stop the benchmark and
  wait until the subscriber would be caught up. Then the number of clients on
  the publisher would be half.
  In this test the conflict slot could be invalidated as expected when the workload
  on the publisher was high, and it would not get invalidated anymore after
  reducing the workload. This shows even if the slot has been invalidated once,
  users can continue to detect the update_deleted conflict by reduce the
  workload on the publisher.
- Total period of the test was 900s for each cases.

(max_conflixt.tar.gz can run the same workload)

Observation:
 - 
 - Parallelism of the publisher side is reduced till 15->7->3 and finally the
   conflict slot is not invalidated.
 - TPS on the subscriber side is improved when the concurrency was reduced.
   This is because the dead tuple accumulation is reduced on subscriber due to
   the reduced workload on the publisher.
 - when publisher has Nclients=3, no regression in subscriber's TPS

Detailed Results Table:
    For max_conflict_retention_duration = 60s
    On publisher:
        Nclients        duration [s]    TPS
        15        72        14079.1
        7        82        9307
        3        446        4133.2

    On subscriber:
        Nclients        duration [s]    TPS
        15        72        6827
        15        81        7200
        15        446        19129.4


    For max_conflict_retention_duration = 120s
    On publisher:
        Nclients        duration [s]    TPS
        15                162                17835.3
        7                152                9503.8
        3                283                4243.9


    On subscriber:
        Nclients        duration [s]    TPS
        15                162                4571.8
        15                152                4707
        15                283                19568.4

Thanks Nisha-san and Hou-san for helping the work.

[1]: https://www.postgresql.org/message-id/CABdArM5SpMyGvQTsX0-d%3Db%2BJAh0VQjuoyf9jFqcrQ3JLws5eOw%40mail.gmail.com
[2]:
https://www.postgresql.org/message-id/TYAPR01MB5692B0182356F041DC9DE3B5F53E2%40TYAPR01MB5692.jpnprd01.prod.outlook.com
[3]: https://www.postgresql.org/message-id/CABdArM4OEwmh_31dQ8_F__VmHwk2ag_M%3DYDD4H%2ByYQBG%2BbHGzg%40mail.gmail.com
[4]:
https://www.postgresql.org/message-id/OSCPR01MB14966F39BE1732B9E433023BFF5E72%40OSCPR01MB14966.jpnprd01.prod.outlook.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED


pgsql-hackers by date:

Previous
From: Etsuro Fujita
Date:
Subject: Re: Proposal to allow DELETE/UPDATE on partitioned tables with unsupported foreign partitions
Next
From: Hannu Krosing
Date:
Subject: Re: Horribly slow pg_upgrade performance with many Large Objects