Re: Perform streaming logical transactions by background workers and parallel apply - Mailing list pgsql-hackers

From Peter Smith
Subject Re: Perform streaming logical transactions by background workers and parallel apply
Date
Msg-id CAHut+PsUCEbu5dgHarMAPZu5rCs62racVXs=CCbkt2q-eXMRYA@mail.gmail.com
Whole thread Raw
In response to RE: Perform streaming logical transactions by background workers and parallel apply  ("houzj.fnst@fujitsu.com" <houzj.fnst@fujitsu.com>)
Responses Re: Perform streaming logical transactions by background workers and parallel apply
List pgsql-hackers
Hi, I have done some testing for this patch. This post describes my
tests so far and the results observed.

Background - Testing multiple PA workers:
---------------------------------------

The "parallel apply" feature allocates the PA workers (if it can) upon
receiving STREAM_START replication protocol msg. This means that if
there are replication messages for overlapping streaming transactions
you should see multiple PA workers processing them (assuming the PA
pool size is configured appropriately).

But AFAIK the only way to cause replication protocol messages to
arrive and be applied in a particular order is by manual testing (e.g
use 2x psql sessions and manually arrange for there to be overlapping
transactions for the published table). I have tried to make this kind
of (regression) testing easier -- in order to test many overlapping
combinations in a repeatable and semi-automated way I have posted a
small enhancement to the isolationtester spec grammar [1]. Using this,
now we can just press a button to test lots of different streaming
transaction combinations and then observe the parallel apply message
dispatching in action...

Test message combinations (from specs/pub-sub.spec):
----------------------------------------------------

# single tx
permutation ps1_begin ps1_ins ps1_commit ps1_sel ps2_sel sub_sleep sub_sel
permutation ps2_begin ps2_ins ps2_commit ps1_sel ps2_sel sub_sleep sub_sel

# rollback
permutation ps1_begin ps1_ins ps1_rollback ps1_sel sub_sleep sub_sel

# overlapping tx rollback and commit
permutation ps1_begin ps1_ins ps2_begin ps2_ins ps1_rollback
ps2_commit sub_sleep sub_sel
permutation ps1_begin ps1_ins ps2_begin ps2_ins ps1_commit
ps2_rollback sub_sleep sub_sel

# overlapping tx commits
permutation ps1_begin ps1_ins ps2_begin ps2_ins ps2_commit ps1_commit
sub_sleep sub_sel
permutation ps1_begin ps1_ins ps2_begin ps2_ins ps1_commit ps2_commit
sub_sleep sub_sel

permutation ps1_begin ps2_begin ps1_ins ps2_ins ps2_commit ps1_commit
sub_sleep sub_sel
permutation ps1_begin ps2_begin ps1_ins ps2_ins ps1_commit ps2_commit
sub_sleep sub_sel

permutation ps1_begin ps2_begin ps2_ins ps1_ins ps2_commit ps1_commit
sub_sleep sub_sel
permutation ps1_begin ps2_begin ps2_ins ps1_ins ps1_commit ps2_commit
sub_sleep sub_sel

Test setup:
-----------

1. Setup publisher and subscriber servers

1a. Publisher server is configured to use new GUC 'force_stream_mode =
true' [2]. This means even single-row inserts cause replication
STREAM_START messages which will trigger the PA workers.

1b. Subscriber server is configured to use new GUC
'max_parallel_apply_workers_per_subscription'. Set this value to
change how many PA workers can be allocated.

2. isolation/specs/pub-test.spec (defines the publisher sessions being tested)


How verified:
-------------

1. Running the isolationtester pub-sub.spec test gives the expected
table results (so data was replicated OK)
- any new permutations can be added as required.
- more overlapping sessions (e.g. 3 or 4...) can be added as required.

2. Changing the publisher GUC 'force_stream_mode' to be true/false
- we can see if PA workers being used or not being used -- (ps -eaf |
grep 'logical replication')

3. Changing the subscriber GUC 'max_parallel_apply_workers_per_subscription'
- set to high value or low value so we can see the PA worker (pool)
being used or filling to capacity

4. I have also patched some temporary logging into code for both "LA"
and "PA" workers
- now the subscriber logfile leaves a trail of evidence about which
worker did what (for apply_dispatch and for locking calls)

Observed Results:
-----------------

1. From the user's POV everything is normal - data gets replicated as
expected regardless of GUC settings (force_streaming /
max_parallel_apply_workers_per_subscription).

[postgres@CentOS7-x64 isolation]$ make check-pub-sub
...
============== creating temporary instance            ==============
============== initializing database system           ==============
============== starting postmaster                    ==============
running on port 61696 with PID 11822
============== creating database "isolation_regression" ==============
CREATE DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
============== running regression test queries        ==============
test pub-sub                      ... ok        33424 ms
============== shutting down postmaster               ==============
============== removing temporary instance            ==============

=====================
 All 1 tests passed.
=====================


2. Confirmation multiple PA workers were used (force_streaming=true /
max_parallel_apply_workers_per_subscription=99)

[postgres@CentOS7-x64 isolation]$ ps -eaf | grep 'logical replication'
postgres  5298  5293  0 Dec19 ?        00:00:00 postgres: logical
replication launcher
postgres  5306  5301  0 Dec19 ?        00:00:00 postgres: logical
replication launcher
postgres 17301  5301  0 10:31 ?        00:00:00 postgres: logical
replication parallel apply worker for subscription 16387
postgres 17524  5301  0 10:31 ?        00:00:00 postgres: logical
replication parallel apply worker for subscription 16387
postgres 21134  5301  0 08:08 ?        00:00:01 postgres: logical
replication apply worker for subscription 16387
postgres 22377 13260  0 10:34 pts/0    00:00:00 grep --color=auto
logical replication

3. Confirmation no PA workers were used when not streaming
(force_streaming=false /
max_parallel_apply_workers_per_subscription=99)

[postgres@CentOS7-x64 isolation]$ ps -eaf | grep 'logical replication'
postgres 26857 26846  0 10:37 ?        00:00:00 postgres: logical
replication launcher
postgres 26875 26864  0 10:37 ?        00:00:00 postgres: logical
replication launcher
postgres 26889 26864  0 10:37 ?        00:00:00 postgres: logical
replication apply worker for subscription 16387
postgres 29901 13260  0 10:39 pts/0    00:00:00 grep --color=auto
logical replication

4. Confirmation only one PA worker gets used when the pool is limited
(force_streaming=true / max_parallel_apply_workers_per_subscription=1)

4a. (processes)
[postgres@CentOS7-x64 isolation]$ ps -eaf | grep 'logical replication'
postgres  2484 13260  0 10:42 pts/0    00:00:00 grep --color=auto
logical replication
postgres 32500 32495  0 10:40 ?        00:00:00 postgres: logical
replication launcher
postgres 32508 32503  0 10:40 ?        00:00:00 postgres: logical
replication launcher
postgres 32514 32503  0 10:41 ?        00:00:00 postgres: logical
replication apply worker for subscription 16387

4b. (logs)
2022-12-20 10:41:43.551 AEDT [32514] LOG:  out of parallel apply workers
2022-12-20 10:41:43.551 AEDT [32514] HINT:  You might need to increase
max_parallel_apply_workers_per_subscription.
2022-12-20 10:41:43.551 AEDT [32514] CONTEXT:  processing remote data
for replication origin "pg_16387" during message type "STREAM START"
in transaction 756

5. Confirmation no PA workers get used when there is none available
(force_streaming=true / max_parallel_apply_workers_per_subscription=0)

5a. (processes)
[postgres@CentOS7-x64 isolation]$ ps -eaf | grep 'logical replication'
postgres 10026 10021  0 10:47 ?        00:00:00 postgres: logical
replication launcher
postgres 10034 10029  0 10:47 ?        00:00:00 postgres: logical
replication launcher
postgres 10041 10029  0 10:47 ?        00:00:00 postgres: logical
replication apply worker for subscription 16387
postgres 13068 13260  0 10:48 pts/0    00:00:00 grep --color=auto
logical replication

5b. (logs)
2022-12-20 10:47:50.216 AEDT [10041] LOG:  out of parallel apply workers
2022-12-20 10:47:50.216 AEDT [10041] HINT:  You might need to increase
max_parallel_apply_workers_per_subscription.
..
Also, there are no "PA" log messages present


Summary
-------

In summary, everything I have tested so far appeared to be working
properly. In other words, for overlapping streamed transactions of
different kinds, and regardless of whether zero/some/all of those
transactions are getting processed by a PA worker, the resulting
replicated data looked consistently OK.


PSA some files
- test_init.sh - sample test script for setup publisher/subscriber
required by spec test.
- spec/pub-sub.spec = spec combinations for causing overlapping
streaming transactions
- pub-sub.out = output from successful isolationtester (make check-pub-sub) run
- SUB.log = subscriber logs augmented with my "LA" and "PA" extra
logging for showing locking/dispatching.

(I can also post my logging patch if anyone is interested to try using
it to see the output like in SUB.log).

NOTE - all testing described in this post above was using v58-0001
only. However, the point of implementing these as a .spec test was to
be able to repeat these same regression tests on newer versions with
minimal manual steps required. Later I plan to fetch/apply the most
recent patch version and repeat these same tests.

------
[1] My isolationtester conninfo enhancement v2 -
https://www.postgresql.org/message-id/CAHut%2BPv_1Mev0709uj_OjyNCzfBjENE3RD9%3Dd9RZYfcqUKfG%3DA%40mail.gmail.com
[2] Shi-san's GUC 'force_streaming_mode' -

https://www.postgresql.org/message-id/flat/OSZPR01MB63104E7449DBE41932DB19F1FD1B9%40OSZPR01MB6310.jpnprd01.prod.outlook.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment

pgsql-hackers by date:

Previous
From: "Hayato Kuroda (Fujitsu)"
Date:
Subject: RE: [Proposal] Add foreign-server health checks infrastructure
Next
From: Amit Kapila
Date:
Subject: Re: Perform streaming logical transactions by background workers and parallel apply