On Friday, April 8, 2022 5:14 PM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote:
> On Wednesday, April 6, 2022 1:20 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> 
> > In this email, I would like to discuss allowing streaming logical
> > transactions (large in-progress transactions) by background workers
> > and parallel apply in general. The goal of this work is to improve the
> > performance of the apply work in logical replication.
> >
> > Currently, for large transactions, the publisher sends the data in
> > multiple streams (changes divided into chunks depending upon
> > logical_decoding_work_mem), and then on the subscriber-side, the apply
> > worker writes the changes into temporary files and once it receives
> > the commit, it read from the file and apply the entire transaction. To
> > improve the performance of such transactions, we can instead allow
> > them to be applied via background workers. There could be multiple
> > ways to achieve this:
> >
> > Approach-1: Assign a new bgworker (if available) as soon as the xact's
> > first stream came and the main apply worker will send changes to this
> > new worker via shared memory. We keep this worker assigned till the
> > transaction commit came and also wait for the worker to finish at
> > commit. This preserves commit ordering and avoid writing to and
> > reading from file in most cases. We still need to spill if there is no
> > worker available. We also need to allow stream_stop to complete by the
> > background worker to finish it to avoid deadlocks because T-1's
> > current stream of changes can update rows in conflicting order with
> > T-2's next stream of changes.
> >
> 
> Attach the POC patch for the Approach-1 of "Perform streaming logical
> transactions by background workers". The patch is still a WIP patch as
> there are serval TODO items left, including:
> 
> * error handling for bgworker
> * support for SKIP the transaction in bgworker
> * handle the case when there is no more worker available
>   (might need spill the data to the temp file in this case)
> * some potential bugs
> 
> The original patch is borrowed from an old thread[1] and was rebased and
> extended/cleaned by me. Comments and suggestions are welcome.

Attach a new version patch which improved the error handling and handled the case
when there is no more worker available (will spill the data to the temp file in this case).

Currently, it still doesn't support skip the streamed transaction in bgworker, because
in this approach, we don't know the last lsn for the streamed transaction being applied,
so cannot get the lsn to SKIP. I will think more about it and keep testing the patch.

Best regards,
Hou zj

Attachment

v2-0001-Perform-streaming-logical-transactions-by-background.patch

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

19 April 2022, 06:57:55

On Thu, Apr 14, 2022 at 9:12 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> On Friday, April 8, 2022 5:14 PM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote:
>
> Attach a new version patch which improved the error handling and handled the case
> when there is no more worker available (will spill the data to the temp file in this case).
>
> Currently, it still doesn't support skip the streamed transaction in bgworker, because
> in this approach, we don't know the last lsn for the streamed transaction being applied,
> so cannot get the lsn to SKIP. I will think more about it and keep testing the patch.
>

I think we can avoid performing the streaming transaction by bgworker
if skip_lsn is set. This needs some more thought but anyway I see
another problem in this patch. I think we won't be able to make the
decision whether to apply the change for a relation that is not in the
'READY' state (see should_apply_changes_for_rel) as we won't know
'remote_final_lsn' by that time for streaming transactions. I think
what we can do here is that before assigning the transaction to
bgworker, we can check if any of the rels is not in the 'READY' state,
we can make the transaction spill the changes as we are doing now.
Even if we do such a check, it is still possible that some rel on
which this transaction is performing operation can appear to be in
'non-ready' state after starting bgworker and for such a case I think
we need to give error and restart the transaction as we have no way to
know whether we need to perform an operation on the 'rel'. This is
possible if the user performs REFRESH PUBLICATION in parallel to this
transaction as that can add a new rel to the pg_subscription_rel.

-- 
With Regards,
Amit Kapila.

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

20 April 2022, 08:57:02

On Tuesday, April 19, 2022 2:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Thu, Apr 14, 2022 at 9:12 AM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Friday, April 8, 2022 5:14 PM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Attach a new version patch which improved the error handling and handled
> the case
> > when there is no more worker available (will spill the data to the temp file in
> this case).
> >
> > Currently, it still doesn't support skip the streamed transaction in bgworker,
> because
> > in this approach, we don't know the last lsn for the streamed transaction
> being applied,
> > so cannot get the lsn to SKIP. I will think more about it and keep testing the
> patch.
> >
> 
> I think we can avoid performing the streaming transaction by bgworker
> if skip_lsn is set. This needs some more thought but anyway I see
> another problem in this patch. I think we won't be able to make the
> decision whether to apply the change for a relation that is not in the
> 'READY' state (see should_apply_changes_for_rel) as we won't know
> 'remote_final_lsn' by that time for streaming transactions. I think
> what we can do here is that before assigning the transaction to
> bgworker, we can check if any of the rels is not in the 'READY' state,
> we can make the transaction spill the changes as we are doing now.
> Even if we do such a check, it is still possible that some rel on
> which this transaction is performing operation can appear to be in
> 'non-ready' state after starting bgworker and for such a case I think
> we need to give error and restart the transaction as we have no way to
> know whether we need to perform an operation on the 'rel'. This is
> possible if the user performs REFRESH PUBLICATION in parallel to this
> transaction as that can add a new rel to the pg_subscription_rel.

Changed as suggested.

Attach the new version patch which cleanup some code and fix above problem. For
now, it won't apply streaming transaction in bgworker if skiplsn is set or any
table is not in 'READY' state.

Besides, extent the subscription streaming option to ('on/off/apply(apply in
bgworker)/spool(spool to file)') so that user can control whether to apply The
transaction in a bgworker.

Best regards,
Hou zj

Attachment

v3-0001-Perform-streaming-logical-transactions-by-background.patch

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

20 April 2022, 12:22:12

On Wednesday, April 20, 2022 4:57 PM houzj.fnst@fujitsu.com wrote:
> 
> On Tuesday, April 19, 2022 2:58 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > On Thu, Apr 14, 2022 at 9:12 AM houzj.fnst@fujitsu.com
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > On Friday, April 8, 2022 5:14 PM houzj.fnst@fujitsu.com
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > Attach a new version patch which improved the error handling and
> > > handled
> > the case
> > > when there is no more worker available (will spill the data to the
> > > temp file in
> > this case).
> > >
> > > Currently, it still doesn't support skip the streamed transaction in
> > > bgworker,
> > because
> > > in this approach, we don't know the last lsn for the streamed
> > > transaction
> > being applied,
> > > so cannot get the lsn to SKIP. I will think more about it and keep
> > > testing the
> > patch.
> > >
> >
> > I think we can avoid performing the streaming transaction by bgworker
> > if skip_lsn is set. This needs some more thought but anyway I see
> > another problem in this patch. I think we won't be able to make the
> > decision whether to apply the change for a relation that is not in the
> > 'READY' state (see should_apply_changes_for_rel) as we won't know
> > 'remote_final_lsn' by that time for streaming transactions. I think
> > what we can do here is that before assigning the transaction to
> > bgworker, we can check if any of the rels is not in the 'READY' state,
> > we can make the transaction spill the changes as we are doing now.
> > Even if we do such a check, it is still possible that some rel on
> > which this transaction is performing operation can appear to be in
> > 'non-ready' state after starting bgworker and for such a case I think
> > we need to give error and restart the transaction as we have no way to
> > know whether we need to perform an operation on the 'rel'. This is
> > possible if the user performs REFRESH PUBLICATION in parallel to this
> > transaction as that can add a new rel to the pg_subscription_rel.
> 
> Changed as suggested.
> 
> Attach the new version patch which cleanup some code and fix above problem.
> For now, it won't apply streaming transaction in bgworker if skiplsn is set or any
> table is not in 'READY' state.
> 
> Besides, extent the subscription streaming option to ('on/off/apply(apply in
> bgworker)/spool(spool to file)') so that user can control whether to apply The
> transaction in a bgworker.

Sorry, there was a miss in the pg_dump testcase which cause failure in CFbot.
Attach a new version patch which fix that.

Best regards,
Hou zj

Attachment

v4-0001-Perform-streaming-logical-transactions-by-background.patch

Re: Perform streaming logical transactions by background workers and parallel apply

From

Peter Smith

Date:

22 April 2022, 04:12:17

Hello Hou-san. Here are my review comments for v4-0001. Sorry, there
are so many of them (it is a big patch); some are trivial, and others
you might easily dismiss due to my misunderstanding of the code. But
hopefully, there are at least some comments that can be helpful in
improving the patch quality.

======

1. General comment - terms

Needs to be more consistent about what exactly you will call this new
worker. Sometimes called "locally apply worker"; sometimes "bgworker";
sometimes "subworker", sometimes "BGW", sometimes other variations etc
… Need to pick ONE good name then update all the references/comments
in the patch to use that name consistently throughout.

~~~

2. General comment - option values

I felt the "streaming" option values ought to be different from what
this patch proposes so it affected some of my following review
comments. (Later I give example what I thought the values should be).

~~~

3. General comment - bool option change to enum

This option change for "streaming" is similar to the options change
for "copy_data=force" that Vignesh is doing for his "infinite
recursion" patch v9-0002 [1]. Yet they seem implemented differently
(i.e. char versus enum). I think you should discuss the 2 approaches
with Vignesh and then code these option changes in a consistent way.

~~~

4. General comment - worker.c globals

There seems a growing number of global variables in the worker.c code.
I was wondering is it really necessary? because the logic becomes more
intricate now if you have to know that some global was set up as a
side-effect of some other function call. E.g maybe if you could do a
few more HTAB lookups to identify the bgworker then might not need to
rely on the globals so much?

======

5. Commit message - typo

and then on the subscriber-side, the apply worker writes the changes into
temporary files and once it receives the commit, it read from the file and
apply the entire transaction. To improve the performance of such transactions,

typo: "read" -> "reads"
typo: "apply" -> "applies"

~~~

6. Commit message - wording

In this approach, we assign a new bgworker (if available) as soon as the xact's
first stream came and the main apply worker will send changes to this new
worker via shared memory. The bgworker will directly apply the change instead
of writing it to temporary files.  We keep this worker assigned till the
transaction commit came and also wait for the worker to finish at commit. This

wording: "came" -> "is received" (2x)

~~~

7. Commit message - terms

(this is the same point as comment #1)

I think there is too much changing of terminology. IMO it will be
easier if you always just call the current main apply workers the
"apply worker" and always call this new worker the "bgworker" (or some
better name). But never just call it the "worker".

~~~

8. Commit message - typo

transaction commit came and also wait for the worker to finish at commit. This
preserves commit ordering and avoid writing to and reading from file in most
cases. We still need to spill if there is no worker available. We also need to

typo: "avoid" -> "avoids"

~~~

9. Commit message - wording/typo

Also extend the subscription streaming option so that user can control whether
apply the streaming transaction in a bgworker or spill the change to disk. User

wording: "Also extend" -> "This patch also extends"
typo: "whether apply" -> "whether to apply"

~~~

10. Commit message - option values

apply the streaming transaction in a bgworker or spill the change to disk. User
can set the streaming option to 'on/off', 'apply', 'spool'. For now, 'on' and

Those values do not really seem intuitive to me. E.g. if you set
"apply" then you already said above that sometimes it might have to
spool anyway if there were no bgworkers available. Why not just name
them like "on/off/parallel"?

(I have written more about this in a later comment #14)

======

11. doc/src/sgml/catalogs.sgml - wording

+       Controls in which modes we handle the streaming of in-progress
transactions.
+       <literal>f</literal> = disallow streaming of in-progress transactions

wording: "Controls in which modes we handle..." -> "Controls how to handle..."

~~~

12. doc/src/sgml/catalogs.sgml - wording

+       <literal>a</literal> = apply changes directly in background worker

wording: "in background worker" -> "using a background worker"

~~~

13. doc/src/sgml/catalogs.sgml - option values

Anyway, all this page will be different if I can persuade you to
change the option values (see comment #14)

======

14. doc/src/sgml/ref/create_subscription.sgml - option values

Since the default value is "off" I felt these options would be
better/simpler if they are just like "off/on/parallel". E.g.
Specifically,  I think the "on" should behave the same as the current
code does, so the user should deliberately choose to use this new
bgworker approach.

e.g.
- "off" = off, same as current PG15
- "on" = on, same as current PG15
- "parallel" = try to use the new bgworker to apply stream

======

15. src/backend/commands/subscriptioncmds.c - SubOpts

Vignesh uses similar code for his "infinite recursion" patch being
developed [1] but he used an enum but here you use a char. I think you
should discuss together both decide to use either enum or char for the
member so there is a consistency.

~~~

16. src/backend/commands/subscriptioncmds.c - combine conditions

+ /*
+ * The set of strings accepted here should match up with the
+ * grammar's opt_boolean_or_string production.
+ */
+ if (pg_strcasecmp(sval, "true") == 0)
+ return SUBSTREAM_APPLY;
+ if (pg_strcasecmp(sval, "false") == 0)
+ return SUBSTREAM_OFF;
+ if (pg_strcasecmp(sval, "on") == 0)
+ return SUBSTREAM_APPLY;
+ if (pg_strcasecmp(sval, "off") == 0)
+ return SUBSTREAM_OFF;
+ if (pg_strcasecmp(sval, "spool") == 0)
+ return SUBSTREAM_SPOOL;
+ if (pg_strcasecmp(sval, "apply") == 0)
+ return SUBSTREAM_APPLY;

Because I think the possible option values should be different to
these I can’t comment much on this code, except to suggest IMO the if
conditions should be combined where the options are considered to be
equivalent.

======

17. src/backend/replication/logical/launcher.c - stop_worker

@@ -72,6 +72,7 @@ static void logicalrep_launcher_onexit(int code, Datum arg);
 static void logicalrep_worker_onexit(int code, Datum arg);
 static void logicalrep_worker_detach(void);
 static void logicalrep_worker_cleanup(LogicalRepWorker *worker);
+static void stop_worker(LogicalRepWorker *worker);

The function name does not seem consistent with the other similar static funcs.

~~~

18. src/backend/replication/logical/launcher.c - change if

@@ -225,7 +226,7 @@ logicalrep_worker_find(Oid subid, Oid relid, bool
only_running)
  LogicalRepWorker *w = &LogicalRepCtx->workers[i];

  if (w->in_use && w->subid == subid && w->relid == relid &&
- (!only_running || w->proc))
+ (!only_running || w->proc) && !w->subworker)
  {
Maybe code would be easier (and then you can comment it) if you do like:

/* TODO: comment here */
if (w->subworker)
continue;

~~~

19. src/backend/replication/logical/launcher.c -
logicalrep_worker_launch comment

@@ -262,9 +263,9 @@ logicalrep_workers_find(Oid subid, bool only_running)
 /*
  * Start new apply background worker, if possible.
  */
-void
+bool
 logicalrep_worker_launch(Oid dbid, Oid subid, const char *subname, Oid userid,
- Oid relid)
+ Oid relid, dsm_handle subworker_dsm)

Saying "start new apply..." comment feels a bit misleading. E.g. this
is also called to start the sync worker. And also for the main apply
worker (which we are not really calling a "background worker" in other
places). So this is the same kind of terminology problem as my review
comment #1.

~~~

20. src/backend/replication/logical/launcher.c - asserts?

I thought maybe there should be some assertions in this code upfront.
E.g. cannot have OidIsValid(relid) and subworker_dsm valid at the same
time.

~~~

21. src/backend/replication/logical/launcher.c - terms

+ else
+ snprintf(bgw.bgw_name, BGW_MAXLEN,
+ "logical replication apply worker for subscription %u", subid);

I think the names of all these workers is a bit vague still in the
messages – e.g. "logical replication worker" versus "logical
replication apply worker" sounds too similar to me. So this is kind of
same as my review comment #1.

~~~

22. src/backend/replication/logical/launcher.c -
logicalrep_worker_stop double unlock?

@@ -450,6 +465,18 @@ logicalrep_worker_stop(Oid subid, Oid relid)
  return;
  }

+ stop_worker(worker);
+
+ LWLockRelease(LogicalRepWorkerLock);
+}

IIUC, sometimes it seems that stop_worker() function might already
release the lock before it returns. In that case won’t this other
explicit lock release be a problem?

~~~

23. src/backend/replication/logical/launcher.c - logicalrep_worker_detach

@@ -600,6 +625,28 @@ logicalrep_worker_attach(int slot)
 static void
 logicalrep_worker_detach(void)
 {
+ /*
+ * If we are the main apply worker, stop all the sub apply workers we
+ * started before.
+ */
+ if (!MyLogicalRepWorker->subworker)
+ {
+ List *workers;
+ ListCell *lc;
+
+ LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
+
+ workers = logicalrep_workers_find(MyLogicalRepWorker->subid, true);
+ foreach(lc, workers)
+ {
+ LogicalRepWorker *w = (LogicalRepWorker *) lfirst(lc);
+ if (w->subworker)
+ stop_worker(w);
+ }
+
+ LWLockRelease(LogicalRepWorkerLock);

Can this have the same double-unlock problem as I described in the
previous review comment #22?

~~~

24. src/backend/replication/logical/launcher.c - ApplyLauncherMain

@@ -869,7 +917,7 @@ ApplyLauncherMain(Datum main_arg)
  wait_time = wal_retrieve_retry_interval;

  logicalrep_worker_launch(sub->dbid, sub->oid, sub->name,
- sub->owner, InvalidOid);
+ sub->owner, InvalidOid, DSM_HANDLE_INVALID);
  }
Now that the logicalrep_worker_launch is retuning a bool, should this
call be checking the return value and taking appropriate action if it
failed?

======

25. src/backend/replication/logical/origin.c - acquire comment

+ /*
+ * We allow the apply worker to get the slot which is acquired by its
+ * leader process.
+ */
+ else if (curstate->acquired_by != 0 && acquire)

The comment was not very clear to me. Does the term "apply worker" in
the comment make sense, or should that say "bgworker"? This might be
another example of my review comment #1.

~~~

26. src/backend/replication/logical/origin.c - acquire code

+ /*
+ * We allow the apply worker to get the slot which is acquired by its
+ * leader process.
+ */
+ else if (curstate->acquired_by != 0 && acquire)
  {
  ereport(ERROR,

I somehow felt that this param would be better called 'skip_acquire',
so all the callers would have to use the opposite boolean and then
this code would say like below (which seemed easier to me). YMMV.

else if (curstate->acquired_by != 0 && !skip_acquire)
  {
  ereport(ERROR,

=====

27. src/backend/replication/logical/tablesync.c

@@ -568,7 +568,8 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
  MySubscription->oid,
  MySubscription->name,
  MyLogicalRepWorker->userid,
- rstate->relid);
+ rstate->relid,
+ DSM_HANDLE_INVALID);
  hentry->last_start_time = now;
Now that the logicalrep_worker_launch is returning a bool, should this
call be checking that the launch was successful before it changes the
last_start_time?

======

28. src/backend/replication/logical/worker.c - file comment

+ * 1) Separate background workers
+ *
+ * Assign a new bgworker (if available) as soon as the xact's first stream came
+ * and the main apply worker will send changes to this new worker via shared
+ * memory. We keep this worker assigned till the transaction commit came and
+ * also wait for the worker to finish at commit. This preserves commit ordering
+ * and avoid writing to and reading from file in most cases. We still need to
+ * spill if there is no worker available. We also need to allow stream_stop to
+ * complete by the background worker to finish it to avoid deadlocks because
+ * T-1's current stream of changes can update rows in conflicting order with
+ * T-2's next stream of changes.

This comment fragment looks the same as the commit message so the
typos/wording reported already for the commit message are applicable
here too.

~~~

29. src/backend/replication/logical/worker.c - file comment

+ * If no worker is available to handle streamed transaction, we write the data
  * to temporary files and then applied at once when the final commit arrives.

wording: "we write the data" -> "the data is written"

~~~

30. src/backend/replication/logical/worker.c - ParallelState

+typedef struct ParallelState

Add to typedefs.list

~~~

31. src/backend/replication/logical/worker.c - ParallelState flags

+typedef struct ParallelState
+{
+ slock_t mutex;
+ bool attached;
+ bool ready;
+ bool finished;
+ bool failed;
+ Oid subid;
+ TransactionId stream_xid;
+ uint32 n;
+} ParallelState;

Those bool states look independent to me. Should they be one enum
member instead of lots of bool members?

~~~

32. src/backend/replication/logical/worker.c - ParallelState comments

+typedef struct ParallelState
+{
+ slock_t mutex;
+ bool attached;
+ bool ready;
+ bool finished;
+ bool failed;
+ Oid subid;
+ TransactionId stream_xid;
+ uint32 n;
+} ParallelState;

Needs some comments. Some might be self-evident but some are not -
e.g. what is 'n'?

~~~

33. src/backend/replication/logical/worker.c - WorkerState

+typedef struct WorkerState

Add to typedefs.list

~~~

34. src/backend/replication/logical/worker.c - WorkerEntry

+typedef struct WorkerEntry

Add to typedefs.list

~~~

35. src/backend/replication/logical/worker.c - static function names

+/* Worker setup and interactions */
+static void setup_dsm(WorkerState *wstate);
+static WorkerState *setup_background_worker(void);
+static void wait_for_worker_ready(WorkerState *wstate, bool notify);
+static void wait_for_transaction_finish(WorkerState *wstate);
+static void send_data_to_worker(WorkerState *wstate, Size nbytes,
+ const void *data);
+static WorkerState *find_or_start_worker(TransactionId xid, bool start);
+static void free_stream_apply_worker(void);
+static bool transaction_applied_in_bgworker(TransactionId xid);
+static void check_workers_status(void);

All these new functions have random-looking names. Since they all are
new to this feature I thought they should all be named similarly...

e.g. something like
bgworker_setup
bgworker_check_status
bgworker_wait_for_ready
etc.

~~~

36. src/backend/replication/logical/worker.c - nchanges

+
+static uint32 nchanges = 0;
+

What is this? Needs a comment.

~~~

37. src/backend/replication/logical/worker.c - handle_streamed_transaction

 static bool
 handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)
 {
- TransactionId xid;
+ TransactionId current_xid = InvalidTransactionId;

  /* not in streaming mode */
- if (!in_streamed_transaction)
+ if (!in_streamed_transaction && !isLogicalApplyWorker)
  return false;
Is it correct to be testing the isLogicalApplyWorker here?

e.g. What if the streaming code is not using bgworkers at all?

At least maybe that comment (/* not in streaming mode */) should be updated?

~~~

38. src/backend/replication/logical/worker.c - handle_streamed_transaction

+ if (current_xid != stream_xid &&
+ !list_member_int(subxactlist, (int) current_xid))
+ {
+ MemoryContext oldctx;
+ char *spname = (char *) palloc(64 * sizeof(char));
+ sprintf(spname, "savepoint_for_xid_%u", current_xid);

Can't the name just be a char[64] on the stack?

~~~

39. src/backend/replication/logical/worker.c - handle_streamed_transaction

+ /*
+ * XXX The publisher side don't always send relation update message
+ * after the streaming transaction, so update the relation in main
+ * worker here.
+ */

typo: "don't" -> "doesn't" ?

~~~

40. src/backend/replication/logical/worker.c - apply_handle_commit_prepared

@@ -976,30 +1116,51 @@ apply_handle_commit_prepared(StringInfo s)
  char gid[GIDSIZE];

  logicalrep_read_commit_prepared(s, &prepare_data);
+
  set_apply_error_context_xact(prepare_data.xid, prepare_data.commit_lsn);

Spurious whitespace?

~~~

41. src/backend/replication/logical/worker.c - apply_handle_commit_prepared

+ /* Check if we have prepared transaction in another bgworker */
+ if (transaction_applied_in_bgworker(prepare_data.xid))
+ {
+ elog(DEBUG1, "received commit for streamed transaction %u", prepare_data.xid);

- /* There is no transaction when COMMIT PREPARED is called */
- begin_replication_step();
+ /* Send commit message */
+ send_data_to_worker(stream_apply_worker, s->len, s->data);

It seems a bit complex/tricky that the code is always relying on all
the side-effects that the global stream_apply_worker will be set.

I am not sure if it is possible to remove the global and untangle
everything. E.g. Why not change the transaction_applied_in_bgworker to
return the bgworker (instead of return bool) and then can assign it to
a local var in this function.

Or can’t you do HTAB lookup in a few more places instead of carrying
around the knowledge of some global var that was initialized in some
other place?

It would be easier if you can eliminate having to be aware of
side-effects happening behind the scenes.

~~~

42. src/backend/replication/logical/worker.c - apply_handle_rollback_prepared

@@ -1019,35 +1180,51 @@ apply_handle_rollback_prepared(StringInfo s)
  char gid[GIDSIZE];

  logicalrep_read_rollback_prepared(s, &rollback_data);
+
  set_apply_error_context_xact(rollback_data.xid,
rollback_data.rollback_end_lsn);

Spurious whitespace?

~~~

43. src/backend/replication/logical/worker.c - apply_handle_rollback_prepared

+ /* Check if we are processing the prepared transaction in a bgworker */
+ if (transaction_applied_in_bgworker(rollback_data.xid))
+ {
+ send_data_to_worker(stream_apply_worker, s->len, s->data);

Same as previous comment #41. Relies on the side effect of something
setting the global stream_apply_worker.

~~~

44. src/backend/replication/logical/worker.c - find_or_start_worker

+ /*
+ * For streaming transactions that is being applied in bgworker, we cannot
+ * decide whether to apply the change for a relation that is not in the
+ * READY state (see should_apply_changes_for_rel) as we won't know
+ * remote_final_lsn by that time. So, we don't start new bgworker in this
+ * case.
+ */

typo: "that is" -> "that are"

~~~

45. src/backend/replication/logical/worker.c - find_or_start_worker

+ if (MySubscription->stream != SUBSTREAM_APPLY)
+ return NULL;
...
+ else if (start && !XLogRecPtrIsInvalid(MySubscription->skiplsn))
+ return NULL;
...
+ else if (start && !AllTablesyncsReady())
+ return NULL;
+ else if (!start && ApplyWorkersHash == NULL)
+ return NULL;

I am not sure but I think most of that rejection if/else can probably
just be "if" (not "else if") because otherwise, the code would have
returned anyhow, right? Removing all the "else" might make the code
more readable.

~~~

46. src/backend/replication/logical/worker.c - find_or_start_worker

+ if (wstate == NULL)
+ {
+ /*
+ * If there is no more worker can be launched here, remove the
+ * entry in hash table.
+ */
+ hash_search(ApplyWorkersHash, &xid, HASH_REMOVE, &found);
+ return NULL;
+ }

wording: "If there is no more worker can be launched here, remove" ->
"If the bgworker cannot be launched, remove..."

~~~

47. src/backend/replication/logical/worker.c - free_stream_apply_worker

+/*
+ * Add the worker to the freelist and remove the entry from hash table.
+ */
+static void
+free_stream_apply_worker(void)

IMO it might be better to pass the bgworker here instead of silently
working with the global stream_apply_worker.

~~~

48. src/backend/replication/logical/worker.c - free_stream_apply_worker

+ elog(LOG, "adding finished apply worker #%u for xid %u to the idle list",
+ stream_apply_worker->pstate->n, stream_apply_worker->pstate->stream_xid);

Should the be an Assert here to check the bgworker state really was FINISHED?

~~~

49. src/backend/replication/logical/worker.c - serialize_stream_prepare

+static void
+serialize_stream_prepare(LogicalRepPreparedTxnData *prepare_data)

Missing function comment.

~~~

50. src/backend/replication/logical/worker.c - serialize_stream_start

-/*
- * Handle STREAM START message.
- */
 static void
-apply_handle_stream_start(StringInfo s)
+serialize_stream_start(bool first_segment)

Missing function comment.

~~~

51. src/backend/replication/logical/worker.c - serialize_stream_stop

+static void
+serialize_stream_stop()
+{

Missing function comment.

~~~

52. src/backend/replication/logical/worker.c - general serialize_XXXX

I can see now that you have created many serialize_XXX functions which
seem to only be called one time. It looks like the only purpose is to
encapsulate the code to make the handler function shorter? But it
seems a bit uneven that you did this only for the serialize cases. If
you really want these separate functions then perhaps there ought to
also be the equivalent bgworker functions too. There seem to be always
3 scenarios:

i.e
1. Worker is the bgworker
2. Worker is Main Apply but a bgworker exists
3. Worker is Main apply and bgworker does not exist.

Perhaps every handler function should have THREE other little
functions that it calls appropriately?

~~~

53. src/backend/replication/logical/worker.c - serialize_stream_abort

+
+static void
+serialize_stream_abort(TransactionId xid, TransactionId subxid)
+{

Missing function comment.

~~~

54. src/backend/replication/logical/worker.c - apply_handle_stream_abort

+ if (isLogicalApplyWorker)
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("[Apply BGW #%u] aborting current transaction xid=%u, subxid=%u",
+ MyParallelState->n, GetCurrentTransactionIdIfAny(),
GetCurrentSubTransactionId())));

Why is the errcode using errcode_for_file_access? (2x)

~~~

55. src/backend/replication/logical/worker.c - apply_handle_stream_abort

+ /*
+ * OK, so it's a subxact. Rollback to the savepoint.
+ *
+ * We also need to read the subxactlist, determine the offset
+ * tracked for the subxact, and truncate the list.
+ */
+ int i;
+ bool found = false;
+ char *spname = (char *) palloc(64 * sizeof(char));

Can that just be char[64] on the stack?

~~~

56. src/backend/replication/logical/worker.c - apply_dispatch

@@ -2511,6 +3061,7 @@ apply_dispatch(StringInfo s)
  break;

  case LOGICAL_REP_MSG_STREAM_START:
+ elog(LOG, "LOGICAL_REP_MSG_STREAM_START");
  apply_handle_stream_start(s);
  break;

I guess this is just for debugging purposes so you should put some
FIXME comment here as a reminder to get rid of it later?

~~~

57. src/backend/replication/logical/worker.c - store_flush_position,
isLogicalApplyWorker

@@ -2618,6 +3169,10 @@ store_flush_position(XLogRecPtr remote_lsn)
 {
  FlushPosition *flushpos;

+ /* We only need to collect the LSN in main apply worker */
+ if (isLogicalApplyWorker)
+ return;
+

This comment is not specific to this function, but for global
isLogicalApplyWorker IMO this should be implemented to look more like
the inline function am_tablesync_worker().

e.g. I think you should replace this global with something like
am_apply_bgworker()

Maybe it should do something like check the value of
MyLogicalRepWorker->subworker?

~~~

58. src/backend/replication/logical/worker.c - LogicalRepApplyLoop

@@ -3467,6 +4025,7 @@ TwoPhaseTransactionGid(Oid subid, TransactionId
xid, char *gid, int szgid)
  snprintf(gid, szgid, "pg_gid_%u_%u", subid, xid);
 }

+
 /*
  * Execute the initial sync with error handling. Disable the subscription,
  * if it's required.

Spurious whitespace

~~~

59. src/backend/replication/logical/worker.c - ApplyWorkerMain

@@ -3733,7 +4292,7 @@ ApplyWorkerMain(Datum main_arg)

  options.proto.logical.publication_names = MySubscription->publications;
  options.proto.logical.binary = MySubscription->binary;
- options.proto.logical.streaming = MySubscription->stream;
+ options.proto.logical.streaming = (MySubscription->stream != SUBSTREAM_OFF);
  options.proto.logical.twophase = false;

I was not sure why this is converting from an enum to a boolean? Is it right?

~~~

60. src/backend/replication/logical/worker.c - LogicalApplyBgwLoop

+ shmq_res = shm_mq_receive(mqh, &len, &data, false);
+
+ if (shmq_res != SHM_MQ_SUCCESS)
+ break;

Should this log some more error information here?

~~~

61. src/backend/replication/logical/worker.c - LogicalApplyBgwLoop

+ if (len == 0)
+ {
+ elog(LOG, "[Apply BGW #%u] got zero-length message, stopping", pst->n);
+ break;
+ }
+ else
+ {
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+ TimestampTz send_time;

Maybe the "else" is not needed here, and if you remove it then it will
get rid of all the unnecessary indentation.

~~~

62. src/backend/replication/logical/worker.c - LogicalApplyBgwLoop

+ /*
+ * We use first byte of message for additional communication between
+ * main Logical replication worker and Apply BGWorkers, so if it
+ * differs from 'w', then process it first.
+ */


I was thinking maybe this switch should include

case 'w':
break;
because then for the "default" case you should give ERROR because
something unexpected arrived.

~~~

63. src/backend/replication/logical/worker.c - ApplyBgwShutdown

+static void
+ApplyBgwShutdown(int code, Datum arg)
+{
+ SpinLockAcquire(&MyParallelState->mutex);
+ MyParallelState->failed = true;
+ SpinLockRelease(&MyParallelState->mutex);
+
+ dsm_detach((dsm_segment *) DatumGetPointer(arg));
+}

Should this do detach first and set the flag last?

~~~

64. src/backend/replication/logical/worker.c - LogicalApplyBgwMain

+ /*
+ * Acquire a worker number.
+ *
+ * By convention, the process registering this background worker should
+ * have stored the control structure at key 0.  We look up that key to
+ * find it.  Our worker number gives our identity: there may be just one
+ * worker involved in this parallel operation, or there may be many.
+ */

Maybe there should be another elog closer to this comment? So as soon
as you know the BGW number log something?

e.g.
elog(LOG, "[Apply BGW #%u] starting", pst->n);

~~~

65. src/backend/replication/logical/worker.c - setup_background_worker

+/*
+ * Register background workers.
+ */
+static WorkerState *
+setup_background_worker(void)

I think that comment needs some more info because it is doing more
than just registering... it is successfully launching the worker
first.

~~~

66. src/backend/replication/logical/worker.c - setup_background_worker

+ if (launched)
+ {
+ /* Wait for worker to become ready. */
+ wait_for_worker_ready(wstate, false);
+
+ ApplyWorkersList = lappend(ApplyWorkersList, wstate);
+ nworkers += 1;
+ }

Do you really need to carry around this global 'nworkers' variable?
Can’t you just check the length of the ApplyWorkerList to get this
number?

~~~

67. src/backend/replication/logical/worker.c - send_data_to_worker

+/*
+ * Send the data to worker via shared-memory queue.
+ */
+static void
+send_data_to_worker(WorkerState *wstate, Size nbytes, const void *data)

wording: "to worker" -> "to the specified apply bgworker"

This is just another example of my comment #1.

~~~

68. src/backend/replication/logical/worker.c - send_data_to_worker

+ if (result != SHM_MQ_SUCCESS)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("could not send tuple to shared-memory queue")));
+}

typo: is "tuples" the right word here?

~~~

69. src/backend/replication/logical/worker.c - wait_for_worker_ready

+
+static void
+wait_for_worker_ready(WorkerState *wstate, bool notify)
+{

Missing function comment.

~~~

70. src/backend/replication/logical/worker.c - wait_for_worker_ready

+
+static void
+wait_for_worker_ready(WorkerState *wstate, bool notify)
+{

'notify' seems a bit of a poor name here. And this param seems a bit
of a strange side-effect for something called wait_for_worker_ready.
If really need to do this way maybe name it something more verbose
like 'notify_received_stream_stop'?

~~~

71. src/backend/replication/logical/worker.c - wait_for_worker_ready

+ if (!result)
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_RESOURCES),
+ errmsg("one or more background workers failed to start")));

Is the ERROR code reachable? IIUC there is no escape from the previous
for (;;) loop except when the result is set to true.

~~~

72. src/backend/replication/logical/worker.c - wait_for_transaction_finish

+
+static void
+wait_for_transaction_finish(WorkerState *wstate)
+{

Missing function comment.

~~~

73. src/backend/replication/logical/worker.c - wait_for_transaction_finish

+ if (finished)
+ {
+ break;
+ }

The brackets are not needed for 1 statement.

~~~

74. src/backend/replication/logical/worker.c - transaction_applied_in_bgworker

+static bool
+transaction_applied_in_bgworker(TransactionId xid)

Instead of side-effect assigning the global variable, why not return
the bgworker (or NULL) and let the caller work with the result?

~~~

75. src/backend/replication/logical/worker.c - check_workers_status

+/*
+ * Check the status of workers and report an error if any bgworker exit
+ * unexpectedly.

wording: -> "... if any bgworker has exited unexpectedly ..."

~~~

76. src/backend/replication/logical/worker.c - check_workers_status

+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("Background worker %u exited unexpectedly",
+ wstate->pstate->n)));

Should that message also give more identifying info about the
*current* worker doing the ERROR - e.g.the one which found this the
other bgworker was failed? Or is that just the PIC in the log message
good enough?

~~~

77. src/backend/replication/logical/worker.c - check_workers_status

+ if (!AllTablesyncsReady() && nfreeworkers != list_length(ApplyWorkersList))
+ {

I did not really understand this code, but isn't there a possibility
that it will cause many restarts if the tablesyncs are taking a long
time to complete?

======

78. src/include/catalog/pg_subscription.

@@ -122,6 +122,18 @@ typedef struct Subscription
  List    *publications; /* List of publication names to subscribe to */
 } Subscription;

+/* Disallow streaming in-progress transactions */
+#define SUBSTREAM_OFF 'f'
+
+/*
+ * Streaming transactions are written to a temporary file and applied only
+ * after the transaction is committed on upstream.
+ */
+#define SUBSTREAM_SPOOL 's'
+
+/* Streaming transactions are appied immediately via a background worker */
+#define SUBSTREAM_APPLY 'a'

IIRC Vignesh had a similar options requirement for his "infinite
recursion" patch [1], except he was using enums instead of #define for
char. Maybe discuss with Vignesh (and either he should change or you
should change) so there is a consistent code style for the options.

======

79. src/include/replication/logicalproto.h - old extern

@@ -243,8 +243,10 @@ extern TransactionId
logicalrep_read_stream_start(StringInfo in,
 extern void logicalrep_write_stream_stop(StringInfo out);
 extern void logicalrep_write_stream_commit(StringInfo out,
ReorderBufferTXN *txn,
     XLogRecPtr commit_lsn);
-extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+extern TransactionId logicalrep_read_stream_commit_old(StringInfo out,
     LogicalRepCommitData *commit_data);

Is anybody still using this "old" function? Maybe I missed it.

======

80. src/include/replication/logicalworker.h

@@ -13,6 +13,7 @@
 #define LOGICALWORKER_H

 extern void ApplyWorkerMain(Datum main_arg);
+extern void LogicalApplyBgwMain(Datum main_arg);

The new name seems inconsistent with the old one. What about calling
it ApplyBgworkerMain?

======

81. src/test/regress/expected/subscription.out

Isn't this missing some test cases for the new options added? E.g. I
never see streaming value is set to 's'.

======

82. src/test/subscription/t/029_on_error.pl

If options values were changed how I suggested (review comment #14)
then I think a change such as this would not be necessary because
everything would be backward compatible.


------
[1] https://www.postgresql.org/message-id/CALDaNm2Fe%3Dg4Tx-DhzwD6NU0VRAfaPedXwWO01maNU7_OfS8fw%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

25 April 2022, 08:35:05

On Friday, April 22, 2022 12:12 PM Peter Smith <smithpb2250@gmail.com> wrote:
> 
> Hello Hou-san. Here are my review comments for v4-0001. Sorry, there
> are so many of them (it is a big patch); some are trivial, and others
> you might easily dismiss due to my misunderstanding of the code. But
> hopefully, there are at least some comments that can be helpful in
> improving the patch quality.

Thanks for the comments !
I think most of the comments make sense and here are explanations for
some of them.

> 24. src/backend/replication/logical/launcher.c - ApplyLauncherMain
> 
> @@ -869,7 +917,7 @@ ApplyLauncherMain(Datum main_arg)
>   wait_time = wal_retrieve_retry_interval;
> 
>   logicalrep_worker_launch(sub->dbid, sub->oid, sub->name,
> - sub->owner, InvalidOid);
> + sub->owner, InvalidOid, DSM_HANDLE_INVALID);
>   }
> Now that the logicalrep_worker_launch is retuning a bool, should this
> call be checking the return value and taking appropriate action if it
> failed?

Not sure we can change the logic of existing caller. I think only the new
caller in the patch is necessary to check this.


> 26. src/backend/replication/logical/origin.c - acquire code
> 
> + /*
> + * We allow the apply worker to get the slot which is acquired by its
> + * leader process.
> + */
> + else if (curstate->acquired_by != 0 && acquire)
>   {
>   ereport(ERROR,
> 
> I somehow felt that this param would be better called 'skip_acquire',
> so all the callers would have to use the opposite boolean and then
> this code would say like below (which seemed easier to me). YMMV.
> 
> else if (curstate->acquired_by != 0 && !skip_acquire)
>   {
>   ereport(ERROR,

Not sure about this.


> 59. src/backend/replication/logical/worker.c - ApplyWorkerMain
> 
> @@ -3733,7 +4292,7 @@ ApplyWorkerMain(Datum main_arg)
> 
>   options.proto.logical.publication_names = MySubscription->publications;
>   options.proto.logical.binary = MySubscription->binary;
> - options.proto.logical.streaming = MySubscription->stream;
> + options.proto.logical.streaming = (MySubscription->stream != SUBSTREAM_OFF);
>   options.proto.logical.twophase = false;
>
> I was not sure why this is converting from an enum to a boolean? Is it right?

I think it's ok, the "logical.streaming" is used in publisher which don't need
to know the exact type of the streaming(it only need to know whether the
streaming is enabled for now)


> 63. src/backend/replication/logical/worker.c - ApplyBgwShutdown
> 
> +static void
> +ApplyBgwShutdown(int code, Datum arg)
> +{
> + SpinLockAcquire(&MyParallelState->mutex);
> + MyParallelState->failed = true;
> + SpinLockRelease(&MyParallelState->mutex);
> +
> + dsm_detach((dsm_segment *) DatumGetPointer(arg));
> +}
> 
> Should this do detach first and set the flag last?

Not sure about this. I think it's fine to detach this at the end.

> 76. src/backend/replication/logical/worker.c - check_workers_status
> 
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("Background worker %u exited unexpectedly",
> + wstate->pstate->n)));
> 
> Should that message also give more identifying info about the
> *current* worker doing the ERROR - e.g.the one which found this the
> other bgworker was failed? Or is that just the PIC in the log message
> good enough?

Currently, only the main apply worker should report this error, so not sure do
we need to report the current worker.

> 77. src/backend/replication/logical/worker.c - check_workers_status
> 
> + if (!AllTablesyncsReady() && nfreeworkers != list_length(ApplyWorkersList))
> + {
> 
> I did not really understand this code, but isn't there a possibility
> that it will cause many restarts if the tablesyncs are taking a long
> time to complete?

I think it's ok, after restarting, we won't start bgworker until all the table
is READY.

Best regards,
Hou zj

Re: Perform streaming logical transactions by background workers and parallel apply

From

vignesh C

Date:

26 April 2022, 07:18:20

On Fri, Apr 8, 2022 at 2:44 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> On Wednesday, April 6, 2022 1:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > In this email, I would like to discuss allowing streaming logical
> > transactions (large in-progress transactions) by background workers
> > and parallel apply in general. The goal of this work is to improve the
> > performance of the apply work in logical replication.
> >
> > Currently, for large transactions, the publisher sends the data in
> > multiple streams (changes divided into chunks depending upon
> > logical_decoding_work_mem), and then on the subscriber-side, the apply
> > worker writes the changes into temporary files and once it receives
> > the commit, it read from the file and apply the entire transaction. To
> > improve the performance of such transactions, we can instead allow
> > them to be applied via background workers. There could be multiple
> > ways to achieve this:
> >
> > Approach-1: Assign a new bgworker (if available) as soon as the xact's
> > first stream came and the main apply worker will send changes to this
> > new worker via shared memory. We keep this worker assigned till the
> > transaction commit came and also wait for the worker to finish at
> > commit. This preserves commit ordering and avoid writing to and
> > reading from file in most cases. We still need to spill if there is no
> > worker available. We also need to allow stream_stop to complete by the
> > background worker to finish it to avoid deadlocks because T-1's
> > current stream of changes can update rows in conflicting order with
> > T-2's next stream of changes.
> >
>
> Attach the POC patch for the Approach-1 of "Perform streaming logical
> transactions by background workers". The patch is still a WIP patch as
> there are serval TODO items left, including:
>
> * error handling for bgworker
> * support for SKIP the transaction in bgworker
> * handle the case when there is no more worker available
>   (might need spill the data to the temp file in this case)
> * some potential bugs
>
> The original patch is borrowed from an old thread[1] and was rebased and
> extended/cleaned by me. Comments and suggestions are welcome.
>
> [1] https://www.postgresql.org/message-id/8eda5118-2dd0-79a1-4fe9-eec7e334de17%40postgrespro.ru
>
> Here are some performance results of the patch shared by Shi Yu off-list.
>
> The performance was tested by varying
> logical_decoding_work_mem, which include two cases:
>
> 1) bulk insert.
> 2) create savepoint and rollback to savepoint.
>
> I used synchronous logical replication in the test, compared SQL execution
> times before and after applying the patch.
>
> The results are as follows. The bar charts and the details of the test are
> Attached as well.
>
> RESULT - bulk insert (5kk)
> ----------------------------------
> logical_decoding_work_mem   64kB    128kB   256kB   512kB   1MB     2MB     4MB     8MB     16MB    32MB    64MB
> HEAD                        51.673  51.199  51.166  50.259  52.898  50.651  51.156  51.210  50.678  51.256  51.138
> patched                     36.198  35.123  34.223  29.198  28.712  29.090  29.709  29.408  34.367  34.716  35.439
>
> RESULT - rollback to savepoint (600k)
> ----------------------------------
> logical_decoding_work_mem   64kB    128kB   256kB   512kB   1MB     2MB     4MB     8MB     16MB    32MB    64MB
> HEAD                        31.101  31.087  30.931  31.015  30.920  31.109  30.863  31.008  30.875  30.775  29.903
> patched                     28.115  28.487  27.804  28.175  27.734  29.047  28.279  27.909  28.277  27.345  28.375
>
>
> Summary:
> 1) bulk insert
>
> For different logical_decoding_work_mem size, it takes about 30% ~ 45% less
> time, which looks good to me. After applying this patch, it seems that the
> performance is better when logical_decoding_work_mem is between 512kB and 8MB.
>
> 2) rollback to savepoint
>
> There is an improvement of about 5% ~ 10% after applying this patch.
>
> In this case, the patch spend less time handling the part that is not
> rolled back, because it saves the time writing the changes into a temporary file
> and reading the file. And for the part that is rolled back, it would spend more
> time than HEAD, because it takes more time to write to filesystem and rollback
> than writing a temporary file and truncating the file. Overall, the results looks
> good.

One comment on the design:
We should have a strategy to release the workers which have completed
applying the transactions, else even though there are some idle
workers for one of the subscriptions, it cannot be used by other
subscriptions.
Like in the following case:
Let's say max_logical_replication_workers is set to 10, if
subscription sub_1 uses all the 10 workers to apply the transactions
and all the 10 workers have finished applying the transactions and
then subscription sub_2 requests some workers for applying
transactions, subscription  sub_2 will not get any workers.
Maybe if the workers have completed applying the transactions,
subscription sub_2 should be able to get these workers in this case.

Regards,
Vignesh

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

29 April 2022, 02:06:48

On Monday, April 25, 2022 4:35 PM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote:
> On Friday, April 22, 2022 12:12 PM Peter Smith <smithpb2250@gmail.com>
> wrote:
> >
> > Hello Hou-san. Here are my review comments for v4-0001. Sorry, there
> > are so many of them (it is a big patch); some are trivial, and others
> > you might easily dismiss due to my misunderstanding of the code. But
> > hopefully, there are at least some comments that can be helpful in
> > improving the patch quality.
> 
> Thanks for the comments !
> I think most of the comments make sense and here are explanations for some
> of them.

Hi,

I addressed the rest of Peter's comments and here is a new version patch.

The naming of the newly introduced option and worker might
need more thought, so I haven't change all of them. I will think over
and change it later.

One comment I didn't address:
> 3. General comment - bool option change to enum
> 
> This option change for "streaming" is similar to the options change
> for "copy_data=force" that Vignesh is doing for his "infinite
> recursion" patch v9-0002 [1]. Yet they seem implemented differently
> (i.e. char versus enum). I think you should discuss the 2 approaches
> with Vignesh and then code these option changes in a consistent way.
> 
> [1] https://www.postgresql.org/message-id/CALDaNm2Fe%3Dg4Tx-DhzwD6NU0VRAfaPedXwWO01maNU7_OfS8fw%40mail.gmail.> com

I think the "streaming" option is a bit different from the "copy_data" option.
Because the "streaming" is a column of the system table (pg_subscription) which
should use "char" type to represent different values in this case(For example:
pg_class.relkind/pg_class.relpersistence/pg_class.relreplident ...).

And the "copy_data" option is not a system table column and I think it's fine
to use Enum for it.

Best regards,
Hou zj

Attachment

v5-0001-Perform-streaming-logical-transactions-by-background.patch

RE: Perform streaming logical transactions by background workers and parallel apply

From

"shiy.fnst@fujitsu.com"

Date:

29 April 2022, 05:22:41

On Fri, Apr 29, 2022 10:07 AM Hou, Zhijie/侯 志杰 <houzj.fnst@fujitsu.com> wrote:
> 
> I addressed the rest of Peter's comments and here is a new version patch.
> 

Thanks for your patch.

The patch modified streaming option in logical replication, it can be set to
'on', 'off' and 'apply'. The new option 'apply' haven't been tested in the tap test.
Attach a patch which modified the subscription tap test to cover both 'on' and
'apply' option. (The main patch is also attached to make cfbot happy.)

Besides, I noticed that for two-phase commit transactions, if the transaction is
prepared by a background worker, the background worker would be asked to handle
the message about commit/rollback this transaction. Is it possible that the
messages about commit/rollback prepared transaction are handled by apply worker
directly?

Regards,
Shi yu

On Thursday, May 5, 2022 1:46 PM Peter Smith <smithpb2250@gmail.com> wrote:

> Here are my review comments for v5-0001.
> I will take a look at the v5-0002 (TAP) patch another time.

Thanks for the comments !

> 4. Commit message
> 
> User can set the streaming option to 'on/off', 'apply'. For now,
> 'apply' means the streaming will be applied via a apply background if
> available. 'on' means the streaming transaction will be spilled to
> disk.
> 
> 
> I think "apply" might not be the best choice of values for this
> meaning, but I think Hou-san already said [1] that this was being
> reconsidered.

Yes, I am thinking over this along with some other related stuff[1] posted by Amit
and sawada. Will change this in next version.

[1] https://www.postgresql.org/message-id/flat/CAA4eK1%2B7D4qAQUQEE8zzQ0fGCqeBWd3rzTaY5N0jVs-VXFc_Xw%40mail.gmail.com

> 7. src/backend/commands/subscriptioncmds.c - defGetStreamingMode
> 
> +static char
> +defGetStreamingMode(DefElem *def)
> +{
> + /*
> + * If no parameter given, assume "true" is meant.
> + */
> + if (def->arg == NULL)
> + return SUBSTREAM_ON;
> 
> But is that right? IIUC all the docs said that the default is OFF.

I think it's right. "arg == NULL" means user specify the streaming option
without the value. Like CREATE SUBSCRIPTION xxx WITH(streaming). The value should
be 'on' in this case.


> 12. src/backend/replication/logical/origin.c - replorigin_session_setup
> 
> @@ -1110,7 +1110,11 @@ replorigin_session_setup(RepOriginId node)
>   if (curstate->roident != node)
>   continue;
> 
> - else if (curstate->acquired_by != 0)
> + /*
> + * We allow the apply worker to get the slot which is acquired by its
> + * leader process.
> + */
> + else if (curstate->acquired_by != 0 && acquire)
> 
> I still feel this is overly-cofusing. Shouldn't comment say "Allow the
> apply bgworker to get the slot...".
> 
> Also the parameter name 'acquire' is hard to reconcile with the
> comment. E.g. I feel all this would be easier to understand if the
> param was  was refactored with a name like 'bgworker' and the code was
> changed to:
> else if (curstate->acquired_by != 0 && !bgworker)
> 
> Of course, the value true/false would need to be flipped on calls too.
> This is the same as my previous comment [PSv4] #26.

I feel it's not good idea to mention bgworker in origin.c. I have remove this
comment and add some other comments in worker.c.

> 26. src/backend/replication/logical/worker.c - apply_handle_stream_abort
> 
> + if (found)
> + {
> + elog(LOG, "rolled back to savepoint %s", spname);
> + RollbackToSavepoint(spname);
> + CommitTransactionCommand();
> + subxactlist = list_truncate(subxactlist, i + 1);
> + }
> 
> Should that elog use the "[Apply BGW #%u]" format like the others for BGW?

I feel the "[Apply BGW #%u]" is a bit hacky and some of them comes from the old
patchset. I will recheck these logs and adjust them and change some log
level in next version.

> 27. src/backend/replication/logical/worker.c - apply_handle_stream_abort
> 
> Should this function be setting stream_apply_worker = NULL somewhere
> when all is done?
> 29. src/backend/replication/logical/worker.c - apply_handle_stream_commit
> 
> I am unsure, but should something be setting the stream_apply_worker =
> NULL somewhere when all is done?

I think the worker already be set to NULL in apply_handle_stream_stop.


> 32. src/backend/replication/logical/worker.c - ApplyBgwShutdown
> 
> +/*
> + * Set the failed flag so that the main apply worker can realize we have
> + * shutdown.
> + */
> +static void
> +ApplyBgwShutdown(int code, Datum arg)
> 
> If the 'code' param is deliberately unused it might be better to say
> so in the comment...

Not sure about this. After searching the codes, I think most of the callback
functions doesn't use and add comments for the 'code' param.


> 45. src/backend/utils/activity/wait_event.c
> 
> @@ -388,6 +388,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
>   case WAIT_EVENT_HASH_GROW_BUCKETS_REINSERT:
>   event_name = "HashGrowBucketsReinsert";
>   break;
> + case WAIT_EVENT_LOGICAL_APPLY_WORKER_READY:
> + event_name = "LogicalApplyWorkerReady";
> + break;
> 
> I am not sure this is the best name for this event since the only
> place it is used (in apply_bgworker_wait_for) is not only waiting for
> READY state. Maybe a name like WAIT_EVENT_LOGICAL_APPLY_BGWORKER or
> WAIT_EVENT_LOGICAL_APPLY_WORKER_SYNC would be more appropriate? Need
> to change the wait_event.h also.

I noticed a similar named "WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE", so I changed
this to WAIT_EVENT_LOGICAL_APPLY_WORKER_STATE_CHANGE.

> 47. src/test/regress/expected/subscription.out - missting test
> 
> Missing some test cases for all new option values? E.g. Where is the
> test using streaming value is set to 'apply'. Same comment as [PSv4]
> #81

The new option is tested in the second patch posted by Shi yu.

I addressed other comments from Peter and the 2PC related comment from Shi.
Here is the version patch.

Best regards,
Hou zj

Attachment

v6-0001-Perform-streaming-logical-transactions-by-background.patch

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

13 May 2022, 08:52:32

On Wednesday, May 11, 2022 1:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Wed, May 11, 2022 at 9:35 AM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, May 10, 2022 at 5:59 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > >
> > > On Tue, May 10, 2022 at 10:39 AM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> > > >
> > > > Having it optional seems a good idea. BTW can the user configure
> > > > how many apply bgworkers can be used per subscription or in the
> > > > whole system? Like max_sync_workers_per_subscription, is it better
> > > > to have a configuration parameter or a subscription option for
> > > > that? If so, setting it to 0 probably means to disable the parallel apply
> feature.
> > > >
> > >
> > > Yeah, that might be useful but we are already giving an option while
> > > creating a subscription whether to allow parallelism, so will it be
> > > useful to give one more way to disable this feature? OTOH, having
> > > something like max_parallel_apply_workers/max_bg_apply_workers at
> > > the system level can give better control for how much parallelism
> > > the user wishes to allow for apply work.
> >
> > Or we can have something like
> > max_parallel_apply_workers_per_subscription that controls how many
> > parallel apply workers can launch per subscription. That also gives
> > better control for the number of parallel apply workers.
> >
> 
> I think we can go either way in this matter as both have their pros and cons. I
> feel limiting the parallel workers per subscription gives better control but
> OTOH, it may not allow max usage of parallelism because some quota from
> other subscriptions might remain unused. Let us see what Hou-San or others
> think on this matter?

Thanks for Amit and Sawada-san's comments !
I will think over these approaches and reply soon.

Best regards,
Hou zj

RE: Perform streaming logical transactions by background workers and parallel apply

From

"shiy.fnst@fujitsu.com"

Date:

13 May 2022, 09:57:15

On Fri, May 6, 2022 4:56 PM Peter Smith <smithpb2250@gmail.com> wrote:
> 
> Here are my review comments for v5-0002 (TAP tests)
> 
> Your changes followed a similar pattern of refactoring so most of my
> comments below is repeated for all the files.
> 

Thanks for your comments.

> ======
> 
> 1. Commit message
> 
> For the tap tests about streaming option in logical replication, test both
> 'on' and 'apply' option.
> 
> SUGGESTION
> Change all TAP tests using the PUBLICATION "streaming" option, so they
> now test both 'on' and 'apply' values.
> 

OK. But "streaming" is a subscription option, so I modified it to:
Change all TAP tests using the SUBSCRIPTION "streaming" option, so they
now test both 'on' and 'apply' values.

> ~~~
> 
> 4. src/test/subscription/t/015_stream.pl
> 
> +# Test streaming mode apply
> +$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE (a > 2)");
>  $node_publisher->wait_for_catchup($appname);
> 
> I think those 2 lines do not really belong after the "# Test streaming
> mode apply" comment. IIUC they are really just doing cleanup from the
> prior test part so I think they should
> 
> a) be *above* this comment (and say "# cleanup the test data") or
> b) maybe it is best to put all the cleanup lines actually inside the
> 'test_streaming' function so that the last thing the function does is
> clean up after itself.
> 
> option b seems tidier to me.
> 

I also think option b seems better, so I put them inside test_streaming().

The rest of the comments are fixed as suggested.

Besides, I noticed that we didn't free the background worker after preparing a
transaction in the main patch, so made some small changes to fix it.

Attach the updated patches.

Regards,
Shi yu

On Fri, May 13, 2022 4:53 PM houzj.fnst@fujitsu.com wrote:
> On Wednesday, May 11, 2022 1:10 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > On Wed, May 11, 2022 at 9:35 AM Masahiko Sawada
> > <sawada.mshk@gmail.com> wrote:
> > >
> > > On Tue, May 10, 2022 at 5:59 PM Amit Kapila <amit.kapila16@gmail.com>
> > wrote:
> > > >
> > > > On Tue, May 10, 2022 at 10:39 AM Masahiko Sawada
> > <sawada.mshk@gmail.com> wrote:
> > > > >
> > > > > Having it optional seems a good idea. BTW can the user configure
> > > > > how many apply bgworkers can be used per subscription or in the
> > > > > whole system? Like max_sync_workers_per_subscription, is it better
> > > > > to have a configuration parameter or a subscription option for
> > > > > that? If so, setting it to 0 probably means to disable the parallel apply
> > feature.
> > > > >
> > > >
> > > > Yeah, that might be useful but we are already giving an option while
> > > > creating a subscription whether to allow parallelism, so will it be
> > > > useful to give one more way to disable this feature? OTOH, having
> > > > something like max_parallel_apply_workers/max_bg_apply_workers at
> > > > the system level can give better control for how much parallelism
> > > > the user wishes to allow for apply work.
> > >
> > > Or we can have something like
> > > max_parallel_apply_workers_per_subscription that controls how many
> > > parallel apply workers can launch per subscription. That also gives
> > > better control for the number of parallel apply workers.
> > >
> >
> > I think we can go either way in this matter as both have their pros and cons. I
> > feel limiting the parallel workers per subscription gives better control but
> > OTOH, it may not allow max usage of parallelism because some quota from
> > other subscriptions might remain unused. Let us see what Hou-San or others
> > think on this matter?
> 
> Thanks for Amit and Sawada-san's comments !
> I will think over these approaches and reply soon.
After reading the thread, I wrote two patches for these comments.

The first patch (see v6-0003):
Improve the feature as suggested in [1].
For the issue mentioned by Amit-san (there is a block problem in the case
mentioned by Sawada-san), after investigating, I think this issue is caused by
unique index. So I added a check to make sure the unique columns are the same
between publisher and subscriber.
For other cases, I added the check that if there is any non-immutable function
present in expression in subscriber's relation. Check from the following 3
items:
    a. The function in triggers;
    b. Column default value expressions and domain constraints;
    c. Constraint expressions.
BTW, I do not add partitioned table related code. I think this part needs other
additional modifications. I will add this later when these modifications are
finished.

The second patch (see v6-0004):
Improve the feature as suggested in [2].
Add a GUC "max_apply_bgworkers_per_subscription" to control parallelism. This
GUC controls how many apply background workers can be launched per
subscription. I set its default value to 3 and do not change the default value
of other GUCs.

[1] - https://www.postgresql.org/message-id/CAA4eK1JwahU_WuP3S%2B7POqta%3DPhm_3gxZeVmJuuoUq1NV%3DkrXA%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAA4eK1%2B7D4qAQUQEE8zzQ0fGCqeBWd3rzTaY5N0jVs-VXFc_Xw%40mail.gmail.com

Attach the patches. (Did not change v6-0001 and v6-0002.)

Regards,
Wang wei

Attachment

RE: Perform streaming logical transactions by background workers and parallel apply

From

"osumi.takamichi@fujitsu.com"

Date:

29 May 2022, 12:25:12

On Wednesday, May 25, 2022 11:25 AM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote:
> Attach the patches. (Did not change v6-0001 and v6-0002.)
Hi,


Some review comments on the new patches from v6-0001 to v6-0004.

<v6-0001>

(1) create_subscription.sgml

+          the transaction is committed. Note that if an error happens when
+          applying changes in a background worker, it might not report the
+          finish LSN of the remote transaction in the server log.

I suggest to add a couple of sentences like below
to the section of logical-replication-conflicts in logical-replication.sgml.

"
Setting streaming mode to 'apply' can export invalid LSN as
finish LSN of failed transaction. Changing the streaming mode and
making the same conflict writes the finish LSN of the
failed transaction in the server log if required.
"

(2) ApplyBgworkerMain


+       PG_TRY();
+       {
+               LogicalApplyBgwLoop(mqh, pst);
+       }
+       PG_CATCH();
+       {

...

+               pgstat_report_subscription_error(MySubscription->oid, false);
+
+               PG_RE_THROW();
+       }
+       PG_END_TRY();


When I stream a transaction in-progress and it causes an error(duplication error),
seemingly the subscription stats (values in pg_stat_subscription_stats) don't
get updated properly. The 2nd argument should be true for apply error.

Also, I observe that both apply_error_count and sync_error_count
get updated together by error. I think we need to check this point as well.


<v6-0003>


(3) logicalrep_write_attrs

+       if (rel->rd_rel->relhasindex)
+       {
+               List       *indexoidlist = RelationGetIndexList(rel);
+               ListCell   *indexoidscan;
+               foreach(indexoidscan, indexoidlist)

and

+                       if (indexRel->rd_index->indisunique)
+                       {
+                               int             i;
+                               /* Add referenced attributes to idindexattrs */
+                               for (i = 0; i < indexRel->rd_index->indnatts; i++)

We don't have each blank line after variable declarations.
There might be some other codes where this point can be applied.
Please check.


(4)

+       /*
+        * If any unique index exist, check that they are same as remoterel.
+        */
+       if (!rel->sameunique)
+               ereport(ERROR,
+                               (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                                errmsg("cannot replicate relation with different unique index"),
+                                errhint("Please change the streaming option to 'on' instead of 'apply'.")));


When I create a logical replication setup with different constraints
and let streaming of in-progress transaction run,
I keep getting this error.

This should be documented as a restriction or something,
to let users know the replication progress can't go forward by
any differences written like in the commit-message in v6-0003.

Also, it would be preferable to test this as well, if we
don't dislike having TAP tests for this.


Best Regards,
    Takamichi Osumi

RE: Perform streaming logical transactions by background workers and parallel apply

From

"wangw.fnst@fujitsu.com"

Date:

30 May 2022, 08:51:59

On Wed, May 18, 2022 3:11 PM Peter Smith <smithpb2250@gmail.com> wrote:
> "Here are my review comments for v6-0001.
Thanks for your comments.

> 7. src/backend/replication/logical/launcher.c - logicalrep_worker_stop_internal
> 
> +
> + Assert(LWLockHeldByMe(LogicalRepWorkerLock));
> +
> 
> I think there should be a comment here to say that this lock is
> required/expected to be released by the caller of this function.
IMHO, it maybe not a problem to read code here.
In addition, keep consistent with other places where invoke this function in
the same file. So I did not change this.

> 9. src/backend/replication/logical/worker.c - General
> 
> Some of the logs have a prefix "[Apply BGW #%u]" and some do not; I
> did not really understand how you decided to prefix or not so I did
> not comment about them individually. Are they all OK? Perhaps if you
> can explain the reason for the choices I can review it better next
> time.
I think most of these logs should be logged in debug mode. So I changed them to
"DEBUG1" level.
And I added the prefix to all messages logged by apply background worker and
deleted some logs that I think maybe not very helpful. 

> 11. src/backend/replication/logical/worker.c - file header comment
> 
> The whole comment is similar to the commit message so any changes made
> there (for #2, #3) should be made here also.
Improve the comments as suggested in #2.
Sorry but I did not find same message as #2 here.

> 13. src/backend/replication/logical/worker.c
> 
> WorkerState
> WorkerEntry
> 
> I felt that these struct names seem too generic - shouldn't they be
> something more like ApplyBgworkerState, ApplyBgworkerEntry
> 
> ~~~
I think we have used "ApplyBgworkerState" in the patch. So I improved this with
the following modifications:
```
ApplyBgworkerState -> ApplyBgworkerStatus
WorkerState -> ApplyBgworkerState
WorkerEntry -> ApplyBgworkerEntry
```
BTW, I also modified the relevant comments and variable names.

> 16. src/backend/replication/logical/worker.c - handle_streamed_transaction
> 
> + * For the main apply worker, if in streaming mode (receiving a block of
> + * streamed transaction), we send the data to the apply background worker.
> + *
> + * For the apply background worker, define a savepoint if new subtransaction
> + * was started.
>   *
>   * Returns true for streamed transactions, false otherwise (regular mode).
>   */
>  static bool
>  handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)
> 
> 16a.
> Typo: "if new subtransaction" -> "if a new subtransaction"
> 
> 16b.
> That "regular mode" comment seems not quite right because IIUC it also
> returns false also for a bgworker (which hardly seems like a "regular
> mode")
16a. Improved it as suggested.
16b. Changed the comment as follows:
From:
```
* Returns true for streamed transactions, false otherwise (regular mode).
```
To:
```
 * For non-streamed transactions, returns false;
 * For streamed transactions, returns true if in main apply worker, false
 * otherwise.
```

> 19. src/backend/replication/logical/worker.c - find_or_start_apply_bgworker
> 
> + if (found)
> + {
> + entry->wstate->pstate->state = APPLY_BGWORKER_BUSY;
> + return entry->wstate;
> + }
> + else if (!start)
> + return NULL;
> +
> + /* If there is at least one worker in the idle list, then take one. */
> + if (list_length(ApplyWorkersIdleList) > 0)
> 
> I felt that there should be a comment (after the return NULL) that says:
> 
> /*
>  * Start a new apply background worker
>  */
> 
> ~~~
Improve this comment here.
After the code that you mentioned, it will try to get a apply background
worker (try to start one or take one from idle list). So I change the comment
as follows:
From:
```
/* If there is at least one worker in the idle list, then take one. */
```
To:
```
/*
 * Now, we try to get a apply background worker.
 * If there is at least one worker in the idle list, then take one.
 * Otherwise, we try to start a new apply background worker.
 */
```

> 22. src/backend/replication/logical/worker.c - apply_handle_stream_start
> 
>   /*
> - * Initialize the worker's stream_fileset if we haven't yet. This will be
> - * used for the entire duration of the worker so create it in a permanent
> - * context. We create this on the very first streaming message from any
> - * transaction and then use it for this and other streaming transactions.
> - * Now, we could create a fileset at the start of the worker as well but
> - * then we won't be sure that it will ever be used.
> + * If we are in main apply worker, check if there is any free bgworker
> + * we can use to process this transaction.
>   */
> - if (MyLogicalRepWorker->stream_fileset == NULL)
> + stream_apply_worker = apply_bgworker_find_or_start(stream_xid,
> first_segment);
> 
> 22a.
> Typo: "in main apply worker" -> "in the main apply worker"
> 
> 22b.
> Since this is not if/else code, it might be better to put
> Assert(!am_apply_bgworker()); above this just to make it more clear.
22a. Improved it as suggested.
22b. 
IMHO, since we have `if (am_apply_bgworker())` above and it will return in this
if-condition, so I just think Assert() might be a bit redundant here.
So I did not change this.
 
> 26. src/backend/replication/logical/worker.c - apply_handle_stream_abort
> 
> + if (found)
> + {
> + elog(LOG, "rolled back to savepoint %s", spname);
> + RollbackToSavepoint(spname);
> + CommitTransactionCommand();
> + subxactlist = list_truncate(subxactlist, i + 1);
> + }
> 
> Does this need to log anything if nothing was found? Or is it ok to
> leave as-is and silently ignore it?
Yes, I think it is okay.

> 33. src/backend/replication/logical/worker.c - check_workers_status
> 
> +/* Set the state of apply background worker */
> +static void
> +apply_bgworker_set_state(char state)
> 
> Maybe OK, or perhaps choose from one of:
> - "Set the state of an apply background worker"
> - "Set the apply background worker state"
Improve it by using the second one.

> 34. src/bin/pg_dump/pg_dump.c - getSubscriptions
> 
> @@ -4450,7 +4450,7 @@ getSubscriptions(Archive *fout)
>   if (fout->remoteVersion >= 140000)
>   appendPQExpBufferStr(query, " s.substream,\n");
>   else
> - appendPQExpBufferStr(query, " false AS substream,\n");
> + appendPQExpBufferStr(query, " 'f' AS substream,\n");
> 
> 
> Is that logic right? Before this patch the attribute was bool; now it
> is char. So doesn't there need to be some conversion/mapping here for
> when you read from >= 140000 but it was still bool so you need to
> convert 'false' -> 'f' and 'true' -> 't'?
Yes, I think it is right.
We could handle the input of option "streaming" : on/true/off/false/apply.

The rest of the comments are improved as suggested.


And thanks for Shi Yu to improve the patch 0002 by addressing the comments in
[1].

Attach the new patches(only changed 0001 and 0002)

[1] - https://www.postgresql.org/message-id/CAHut%2BPv_0nfUxriwxBQnZTOF5dy5nfG5NtWMr8e00mPrt2Vjzw%40mail.gmail.com

Regards,
Wang wei

On Wed, Jun 1, 2022 1:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Jun 1, 2022 at 7:30 AM Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
> >
> > On Tue, May 31, 2022 at 5:53 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > >
> > > On Mon, May 30, 2022 at 5:08 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > > >
> > > > On Mon, May 30, 2022 at 2:22 PM wangw.fnst@fujitsu.com
> > > > <wangw.fnst@fujitsu.com> wrote:
> > > > >
> > > > > Attach the new patches(only changed 0001 and 0002)
> > > > >
> > > >
> > >
> > > This patch allows the same replication origin to be used by the main
> > > apply worker and the bgworker that uses it to apply streaming
> > > transactions. See the changes [1] in the patch. I am not completely
> > > sure whether that is a good idea even though I could not spot or think
> > > of problems that can't be fixed in your patch. I see that currently
> > > both the main apply worker and bgworker will assign MyProcPid to the
> > > assigned origin slot, this can create the problem because
> > > ReplicationOriginExitCleanup() can clean it up even though the main
> > > apply worker or another bgworker is still using that origin slot.
> >
> > Good point.
> >
> > > Now,
> > > one way to fix is that we assign only the main apply worker's
> > > MyProcPid to session_replication_state->acquired_by. I have tried to
> > > think about the concurrency issues as multiple workers could now point
> > > to the same replication origin state. I think it is safe because the
> > > patch maintains the commit order by allowing only one process to
> > > commit at a time, so no two workers will be operating on the same
> > > origin at the same time. Even, though there is no case where the patch
> > > will try to advance the session's origin concurrently, it appears safe
> > > to do so as we change/advance the session_origin LSNs under
> > > replicate_state LWLock.
> >
> > Right. That way, the cleanup is done only by the main apply worker.
> > Probably the bgworker can check if the origin is already acquired by
> > its (leader) main apply worker process for safety.
> >
> 
> Yeah, that makes sense.
> 
> > >
> > > Another idea could be that we allow multiple replication origins (one
> > > for each bgworker and one for the main apply worker) for the apply
> > > workers corresponding to a subscription. Then on restart, we can find
> > > the highest LSN among all the origins for a subscription. This should
> > > work primarily because we will maintain the commit order. Now, for
> > > this to work we need to somehow map all the origins for a subscription
> > > and one possibility is that we have a subscription id in each of the
> > > origin names. Currently we use ("pg_%u", MySubscription->oid) as
> > > origin_name. We can probably append some unique identifier number for
> > > each worker to allow each origin to have a subscription id. We need to
> > > drop all origins for a particular subscription on DROP SUBSCRIPTION. I
> > > think having multiple origins for the same subscription will have some
> > > additional work when we try to filter changes based on origin.
> >
> > It also seems to work but need additional work and resource.
> >
> > > The advantage of the first idea is that it won't increase the need to
> > > have more origins per subscription but it is quite possible that I am
> > > missing something and there are problems due to which we can't use
> > > that approach.
> >
> > I prefer the first idea as it's simpler than the second one. I don't
> > see any concurrency problem so far unless I'm not missing something.
> >
> 
> Thanks for evaluating it and sharing your opinion.
Thanks for your comments and opinions.

I fixed this problem by following the first suggestion. I also added the
relevant checks and changed the relevant comments.

Thanks for Shi Yu to add some tests as suggested by Osumi-san in [1].#4 and
improve the 0002 patch by adding some checks to see if the apply background
worker starts.

Attach the new patches.
1. Add some descriptions related to "apply" mode to logical-replication.sgml
and create_subscription.sgml.(suggested by Osumi-san in [1].#1,#4)
2. Fix the problem that values in pg_stat_subscription_stats are not updated
properly. (suggested by Osumi-san in [1].#2)
3. Improve the code formatting of the patches. (suggested by Osumi-san in [1].#3)
4. Add some tests in 0003 patch. And improve some tests by adding some checks
to see if the apply background worker starts in 0002 patch. (suggested by
Osumi-san in [1].#4 and Shi Yu)
5. Improve the log message. (suggested by Amit-san in [2].#1)
6. Separate the new logic related to apply background worker to new file
applybgwroker.c. (suggested by Amit-san in [2].#2)
7. Improve function handle_streamed_transaction. (suggested by Amit-san in[2].#3)
8. Improve some comments. (suggested by Amit-san in [2].#4,#6 and me)
9. Fix the problem that the structure member "acquired_by" is incorrectly set
when apply background worker tries to get replication origin.
(suggested by Amit-san in [3])

[1] -
https://www.postgresql.org/message-id/TYCPR01MB83735AEE38370254ED495B06EDDA9%40TYCPR01MB8373.jpnprd01.prod.outlook.com
[2] - https://www.postgresql.org/message-id/CAA4eK1Jt08SYbRt_-rbSWNg%3DX9-m8%2BRdP5PosfnQgyF-z8bkxQ%40mail.gmail.com
[3] - https://www.postgresql.org/message-id/CAA4eK1%2BZ6ahpTQK2KzkvQ1kN-urVS9-N_RDM11MS%2BbtqaB8Bpw%40mail.gmail.com

Regards,
Wang wei

On Thur, Jun 2, 2022 6:02 PM I wrote:
> Attach the new patches.

I tried to improve the patches by following 2 points:

1. Improved the patch as suggested by Amit-san that I mentioned in [1].
When publisher sends a "STREAM ABORT" message to subscriber, add the lsn and
time of this abort to this message.(see function logicalrep_write_stream_abort)
When subscriber receives this message, it will update the replication origin.
(see function apply_handle_stream_abort and function RecordTransactionAbort)

2. Fixed missing settings for two GUCs (session_replication_role and
search_path) in apply background worker in patch 0001 and improved checking of
trigger functions in patch 0003.

Thanks to Hou Zhi Jie for adding the aborts message related infrastructure for
the first point.
Thanks to Shi Yu for pointing out the second point.

Attach the new patches.(only changed 0001 and 0003)

[1] -
https://www.postgresql.org/message-id/OS3PR01MB6275FBD9359F8ED0EDE7E5459EDE9%40OS3PR01MB6275.jpnprd01.prod.outlook.com

Regards,
Wang wei

Attachment

RE: Perform streaming logical transactions by background workers and parallel apply

From

"wangw.fnst@fujitsu.com"

Date:

14 June 2022, 03:37:04

On Wed, Jun 8, 2022 3:13 PM I wrote:
> Attach the new patches.(only changed 0001 and 0003)

I tried to improve the patches by following points:

1. Initialize variable include_abort_lsn to false. It reports a warning in
cfbot. (see patch v10-0001)
BTW, I merged the patch that added the new GUC (see v9-0004) into patch 0001.

2. Because of the improvement #2 in [1], the foreign key could not be detected
when checking trigger function. So added additional checks for the foreign key.
(see patch 0004)

3. Adding a check for the partition table when trying to apply changes in the
apply background worker. (see patch 0004)
In additional, the partition cache map on subscriber have several bugs (see
thread [2]). Because patch 0004 is developed based on the patches in [2], so I
merged the patches(v4-0001~v4-0003) in [2] into a temporary patch 0003 here.
After the patches in [2] is committed, I will delete patch 0003 and rebase
patch 0004.

4. Improve constraint checking in a separate patch as suggested by Amit-san in
[3] #6.(see patch 0005)
I added a new field "bool subretry" in catalog pg_subscription. I use this
field to indicate whether the transaction that we are going to process has
failed before.
If apply worker/bgworker was exit with an error, this field will be set to
true; If we successfully apply a transaction, this field will be set to false.
If we retry to apply a streaming transaction, whether the user sets the
streaming option to "on" or "apply", we will apply the transaction in the apply
worker.

Attach the new patches.
Only changed patches 0001, 0004 and added new separate patch 0005.

[1] -
https://www.postgresql.org/message-id/OS3PR01MB6275208A2F8ED832710F65E09EA49%40OS3PR01MB6275.jpnprd01.prod.outlook.com
[2] -
https://www.postgresql.org/message-id/flat/OSZPR01MB6310F46CD425A967E4AEF736FDA49%40OSZPR01MB6310.jpnprd01.prod.outlook.com
[3] - https://www.postgresql.org/message-id/CAA4eK1Jt08SYbRt_-rbSWNg%3DX9-m8%2BRdP5PosfnQgyF-z8bkxQ%40mail.gmail.com

Regards,
Wang wei

Attachment

RE: Perform streaming logical transactions by background workers and parallel apply

From

"wangw.fnst@fujitsu.com"

Date:

15 June 2022, 08:26:40

On Tues, Jun 14, 2022 11:17 AM I wrote:
> Attach the new patches.
> ......
> 3. Adding a check for the partition table when trying to apply changes in the
> apply background worker. (see patch 0004)
> In additional, the partition cache map on subscriber have several bugs (see
> thread [2]). Because patch 0004 is developed based on the patches in [2], so I
> merged the patches(v4-0001~v4-0003) in [2] into a temporary patch 0003 here.
> After the patches in [2] is committed, I will delete patch 0003 and rebase
> patch 0004.
I added some test cases for this (see patch 0004). In patch 0005, I made
corresponding adjustments according to these test cases.
I also slightly modified the comments about the check for unique index. (see
patch 0004)

Also rebased the temporary patch 0003 because the first patch in thread [1] is
committed (see commit 5a97b132 in HEAD) .

Attach the new patches.
Only changed patches 0004, 0005.

[1] -
https://www.postgresql.org/message-id/OSZPR01MB6310F46CD425A967E4AEF736FDA49%40OSZPR01MB6310.jpnprd01.prod.outlook.com

Regards,
Wang wei

Attachment

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

15 June 2022, 12:12:44

On Tue, Jun 14, 2022 at 9:07 AM wangw.fnst@fujitsu.com
<wangw.fnst@fujitsu.com> wrote:
>
>
> Attach the new patches.
> Only changed patches 0001, 0004 and added new separate patch 0005.
>

Few questions/comments on 0001
===========================
1.
In the commit message, I see: "We also need to allow stream_stop to
complete by the apply background worker to avoid deadlocks because
T-1's current stream of changes can update rows in conflicting order
with T-2's next stream of changes."

Thinking about this, won't the T-1 and T-2 deadlock on the publisher
node as well if the above statement is true?

2.
+       <para>
+        The apply background workers are taken from the pool defined by
+        <varname>max_logical_replication_workers</varname>.
+       </para>
+       <para>
+        The default value is 3. This parameter can only be set in the
+        <filename>postgresql.conf</filename> file or on the server command
+        line.
+       </para>

Is there a reason to choose this number as 3? Why not 2 similar to
max_sync_workers_per_subscription?

3.
+
+  <para>
+   Setting streaming mode to <literal>apply</literal> could export invalid LSN
+   as finish LSN of failed transaction. Changing the streaming mode and making
+   the same conflict writes the finish LSN of the failed transaction in the
+   server log if required.
+  </para>

How will the user identify that this is an invalid LSN value and she
shouldn't use it to SKIP the transaction? Can we change the second
sentence to: "User should change the streaming mode to 'on' if they
would instead wish to see the finish LSN on error. Users can use
finish LSN to SKIP applying the transaction." I think we can give
reference to docs where the SKIP feature is explained.

4.
+ * This file contains routines that are intended to support setting up, using,
+ * and tearing down a ApplyBgworkerState.
+ * Refer to the comments in file header of logical/worker.c to see more
+ * informations about apply background worker.

Typo. /informations/information.

Consider having an empty line between the above two lines.

5.
+ApplyBgworkerState *
+apply_bgworker_find_or_start(TransactionId xid, bool start)
{
...
...
+ if (!TransactionIdIsValid(xid))
+ return NULL;
+
+ /*
+ * We don't start new background worker if we are not in streaming apply
+ * mode.
+ */
+ if (MySubscription->stream != SUBSTREAM_APPLY)
+ return NULL;
+
+ /*
+ * We don't start new background worker if user has set skiplsn as it's
+ * possible that user want to skip the streaming transaction. For
+ * streaming transaction, we need to spill the transaction to disk so that
+ * we can get the last LSN of the transaction to judge whether to skip
+ * before starting to apply the change.
+ */
+ if (start && !XLogRecPtrIsInvalid(MySubscription->skiplsn))
+ return NULL;
+
+ /*
+ * For streaming transactions that are being applied in apply background
+ * worker, we cannot decide whether to apply the change for a relation
+ * that is not in the READY state (see should_apply_changes_for_rel) as we
+ * won't know remote_final_lsn by that time. So, we don't start new apply
+ * background worker in this case.
+ */
+ if (start && !AllTablesyncsReady())
+ return NULL;
...
...
}

Can we move some of these starting checks to a separate function like
canstartapplybgworker()?

-- 
With Regards,
Amit Kapila.

RE: Perform streaming logical transactions by background workers and parallel apply

From

"wangw.fnst@fujitsu.com"

Date:

17 June 2022, 07:17:10

On Wed, Jun 15, 2022 at 8:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> Few questions/comments on 0001
> ===========================
Thanks for your comments.

> 1.
> In the commit message, I see: "We also need to allow stream_stop to
> complete by the apply background worker to avoid deadlocks because
> T-1's current stream of changes can update rows in conflicting order
> with T-2's next stream of changes."
> 
> Thinking about this, won't the T-1 and T-2 deadlock on the publisher
> node as well if the above statement is true?
Yes, I think so.
I think if table's unique index/constraint of the publisher and the subscriber
are consistent, the deadlock will occur on the publisher-side.
If it is inconsistent, deadlock may only occur in the subscriber. But since we
added the check for these (see patch 0004), so it seems okay to not handle this
at STREAM_STOP.

BTW, I made the following improvements to the code (#a, #c are improved in 0004
patch, #b, #d and #e are improved in 0001 patch.) :
a.
I added some comments in the function apply_handle_stream_stop to explain why
we do not need to allow stream_stop to complete by the apply background worker.
b.
I deleted related commit message in 0001 patch and the related comments in file
header (worker.c).
c.
Renamed the function logicalrep_rel_mark_apply_bgworker to
logicalrep_rel_mark_safe_in_apply_bgworker. Also did some slight improvements
in this function.
d.
When apply worker sends stream xact messages to apply background worker, only
wait for apply background worker to complete when commit, prepare and abort of
toplevel xact.
e.
The state setting of apply background worker was not very accurate before, so
improved this (see the invocations to function pgstat_report_activity in
function LogicalApplyBgwLoop, apply_handle_stream_start and
apply_handle_stream_abort).

> 2.
> +       <para>
> +        The apply background workers are taken from the pool defined by
> +        <varname>max_logical_replication_workers</varname>.
> +       </para>
> +       <para>
> +        The default value is 3. This parameter can only be set in the
> +        <filename>postgresql.conf</filename> file or on the server command
> +        line.
> +       </para>
> 
> Is there a reason to choose this number as 3? Why not 2 similar to
> max_sync_workers_per_subscription?
Improved the default as suggested.

> 3.
> +
> +  <para>
> +   Setting streaming mode to <literal>apply</literal> could export invalid LSN
> +   as finish LSN of failed transaction. Changing the streaming mode and making
> +   the same conflict writes the finish LSN of the failed transaction in the
> +   server log if required.
> +  </para>
> 
> How will the user identify that this is an invalid LSN value and she
> shouldn't use it to SKIP the transaction? Can we change the second
> sentence to: "User should change the streaming mode to 'on' if they
> would instead wish to see the finish LSN on error. Users can use
> finish LSN to SKIP applying the transaction." I think we can give
> reference to docs where the SKIP feature is explained.
Improved the sentence as suggested.
And I added the reference after the statement in your suggestion.
It looks like:
```
... Users can use finish LSN to SKIP applying the transaction by running <link
linkend="sql-altersubscription"><command>ALTER SUBSCRIPTION ...
SKIP</command></link>.
```

> 4.
> + * This file contains routines that are intended to support setting up, using,
> + * and tearing down a ApplyBgworkerState.
> + * Refer to the comments in file header of logical/worker.c to see more
> + * informations about apply background worker.
> 
> Typo. /informations/information.
> 
> Consider having an empty line between the above two lines.
Improved the message as suggested.

> 5.
> +ApplyBgworkerState *
> +apply_bgworker_find_or_start(TransactionId xid, bool start)
> {
> ...
> ...
> + if (!TransactionIdIsValid(xid))
> + return NULL;
> +
> + /*
> + * We don't start new background worker if we are not in streaming apply
> + * mode.
> + */
> + if (MySubscription->stream != SUBSTREAM_APPLY)
> + return NULL;
> +
> + /*
> + * We don't start new background worker if user has set skiplsn as it's
> + * possible that user want to skip the streaming transaction. For
> + * streaming transaction, we need to spill the transaction to disk so that
> + * we can get the last LSN of the transaction to judge whether to skip
> + * before starting to apply the change.
> + */
> + if (start && !XLogRecPtrIsInvalid(MySubscription->skiplsn))
> + return NULL;
> +
> + /*
> + * For streaming transactions that are being applied in apply background
> + * worker, we cannot decide whether to apply the change for a relation
> + * that is not in the READY state (see should_apply_changes_for_rel) as we
> + * won't know remote_final_lsn by that time. So, we don't start new apply
> + * background worker in this case.
> + */
> + if (start && !AllTablesyncsReady())
> + return NULL;
> ...
> ...
> }
> 
> Can we move some of these starting checks to a separate function like
> canstartapplybgworker()?
Improved as suggested.

BTW, I rebased the temporary patch 0003 because one patch in thread [1] is
committed (see commit b7658c24c7 in HEAD).

Attach the new patches.
Only changed patches 0001, 0004.

Regards,
Wang wei

Attachment

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

17 June 2022, 08:57:45

On Fri, Jun 17, 2022 at 12:47 PM wangw.fnst@fujitsu.com
<wangw.fnst@fujitsu.com> wrote:
>
> Attach the new patches.
> Only changed patches 0001, 0004.
>

Few more comments on the previous version of patch:
===========================================
1.
+/*
+ * Count the number of registered (not necessarily running) apply background
+ * worker for a subscription.
+ */

/worker/workers

2.
+static void
+apply_bgworker_setup_dsm(ApplyBgworkerState *wstate)
+{
...
...
+ int64 queue_size = 160000000; /* 16 MB for now */

I think it would be better to use define for this rather than a
hard-coded value.

3.
+/*
+ * Status for apply background worker.
+ */
+typedef enum ApplyBgworkerStatus
+{
+ APPLY_BGWORKER_ATTACHED = 0,
+ APPLY_BGWORKER_READY,
+ APPLY_BGWORKER_BUSY,
+ APPLY_BGWORKER_FINISHED,
+ APPLY_BGWORKER_EXIT
+} ApplyBgworkerStatus;

It would be better if you can add comments to explain each of these states.

4.
+ /* Set up one message queue per worker, plus one. */
+ mq = shm_mq_create(shm_toc_allocate(toc, (Size) queue_size),
+    (Size) queue_size);
+ shm_toc_insert(toc, APPLY_BGWORKER_KEY_MQ, mq);
+ shm_mq_set_sender(mq, MyProc);

I don't understand the meaning of 'plus one' in the above comment as
the patch seems to be setting up just one queue here?

5.
+
+ /* Attach the queues. */
+ wstate->mq_handle = shm_mq_attach(mq, seg, NULL);

Similar to above. If there is only one queue then the comment should
say queue instead of queues.

6.
  snprintf(bgw.bgw_name, BGW_MAXLEN,
  "logical replication worker for subscription %u", subid);
+ else
+ snprintf(bgw.bgw_name, BGW_MAXLEN,
+ "logical replication background apply worker for subscription %u ", subid);

No need for extra space after %u in the above code.

7.
+ launched = logicalrep_worker_launch(MyLogicalRepWorker->dbid,
+ MySubscription->oid,
+ MySubscription->name,
+ MyLogicalRepWorker->userid,
+ InvalidOid,
+ dsm_segment_handle(wstate->dsm_seg));
+
+ if (launched)
+ {
+ /* Wait for worker to attach. */
+ apply_bgworker_wait_for(wstate, APPLY_BGWORKER_ATTACHED);

In logicalrep_worker_launch(), we already seem to be waiting for
workers to attach via WaitForReplicationWorkerAttach(), so it is not
clear to me why we need to wait again? If there is a genuine reason
then it is better to add some comments to explain it. I think in some
way, we need to know if the worker is successfully attached and we may
not get that via WaitForReplicationWorkerAttach, so there needs to be
some way to know that but this doesn't sound like a very good idea. If
that understanding is correct then can we think of a better way?

8. I think we can simplify apply_bgworker_find_or_start by having
separate APIs for find and start. Most of the places need to use find
API except for the first stream. If we do that then I think you don't
need to make a hash entry unless we established ApplyBgworkerState
which currently looks odd as you need to remove the entry if we fail
to allocate the state.

9.
+ /*
+ * TO IMPROVE: Do we need to display the apply background worker's
+ * information in pg_stat_replication ?
+ */
+ UpdateWorkerStats(last_received, send_time, false);

In this do you mean to say pg_stat_subscription? If so, then to decide
whether we need to update stats here we should see what additional
information we can update here which is not possible via the main
apply worker?

10.
ApplyBgworkerMain
{
...
+ /* Load the subscription into persistent memory context. */
+ ApplyContext = AllocSetContextCreate(TopMemoryContext,
...

This comment seems to be copied from ApplyWorkerMain but doesn't apply here.

-- 
With Regards,
Amit Kapila.

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

20 June 2022, 02:59:39

On Fri, Jun 17, 2022 at 12:47 PM wangw.fnst@fujitsu.com
<wangw.fnst@fujitsu.com> wrote:
>
> On Wed, Jun 15, 2022 at 8:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Few questions/comments on 0001
> > ===========================
> Thanks for your comments.
>
> > 1.
> > In the commit message, I see: "We also need to allow stream_stop to
> > complete by the apply background worker to avoid deadlocks because
> > T-1's current stream of changes can update rows in conflicting order
> > with T-2's next stream of changes."
> >
> > Thinking about this, won't the T-1 and T-2 deadlock on the publisher
> > node as well if the above statement is true?
> Yes, I think so.
> I think if table's unique index/constraint of the publisher and the subscriber
> are consistent, the deadlock will occur on the publisher-side.
> If it is inconsistent, deadlock may only occur in the subscriber. But since we
> added the check for these (see patch 0004), so it seems okay to not handle this
> at STREAM_STOP.
>
> BTW, I made the following improvements to the code (#a, #c are improved in 0004
> patch, #b, #d and #e are improved in 0001 patch.) :
> a.
> I added some comments in the function apply_handle_stream_stop to explain why
> we do not need to allow stream_stop to complete by the apply background worker.
>

I have improved the comments in this and other related sections of the
patch. See attached.

>
>
> > 3.
> > +
> > +  <para>
> > +   Setting streaming mode to <literal>apply</literal> could export invalid LSN
> > +   as finish LSN of failed transaction. Changing the streaming mode and making
> > +   the same conflict writes the finish LSN of the failed transaction in the
> > +   server log if required.
> > +  </para>
> >
> > How will the user identify that this is an invalid LSN value and she
> > shouldn't use it to SKIP the transaction? Can we change the second
> > sentence to: "User should change the streaming mode to 'on' if they
> > would instead wish to see the finish LSN on error. Users can use
> > finish LSN to SKIP applying the transaction." I think we can give
> > reference to docs where the SKIP feature is explained.
> Improved the sentence as suggested.
>

You haven't answered first part of the comment: "How will the user
identify that this is an invalid LSN value and she shouldn't use it to
SKIP the transaction?". Have you checked what value it displays? For
example, in one of the case in apply_error_callback as shown in below
code, we don't even display finish LSN if it is invalid.
else if (XLogRecPtrIsInvalid(errarg->finish_lsn))
errcontext("processing remote data for replication origin \"%s\"
during \"%s\" in transaction %u",
   errarg->origin_name,
   logicalrep_message_type(errarg->command),
   errarg->remote_xid);

-- 
With Regards,
Amit Kapila.

Attachment

improve_comments_1.patch

Re: Perform streaming logical transactions by background workers and parallel apply

From

Peter Smith

Date:

21 June 2022, 01:41:20

Here are some review comments for the v11-0001 patch.

(I will review the remaining patches 0002-0005 and post any comments later)

======

1. General

I still feel that 'apply' seems like a meaningless enum value for this
feature because from a user point-of-view every replicated change gets
"applied". IMO something like 'streaming = parallel' or 'streaming =
background' (etc) might have more meaning for a user.

======

2. Commit message

We also need to allow stream_stop to complete by the
apply background worker to avoid deadlocks because T-1's current stream of
changes can update rows in conflicting order with T-2's next stream of changes.

Did this mean to say?
"allow stream_stop to complete by" -> "allow stream_stop to be performed by"

~~~

3. Commit message

This patch also extends the SUBSCRIPTION 'streaming' option so that the user
can control whether to apply the streaming transaction in an apply background
worker or spill the change to disk. User can set the streaming option to
'on/off', 'apply'. For now, 'apply' means the streaming will be applied via a
apply background worker if available. 'on' means the streaming transaction will
be spilled to disk.

3a.
"option" -> "parameter" (2x)

3b.
"User can" -> "The user can"

3c.
I think this part should also mention that the stream parameter
default is unchanged...

======

4. doc/src/sgml/config.sgml

+       <para>
+        Maximum number of apply background workers per subscription. This
+        parameter controls the amount of parallelism of the streaming of
+        in-progress transactions if we set subscription option
+        <literal>streaming</literal> to <literal>apply</literal>.
+       </para>

"if we set subscription option <literal>streaming</literal> to
<literal>apply</literal>." -> "when subscription parameter
 <literal>streaming = apply</literal>.

======

5. doc/src/sgml/config.sgml

+  <para>
+   Setting streaming mode to <literal>apply</literal> could export invalid LSN
+   as finish LSN of failed transaction. Changing the streaming mode and making
+   the same conflict writes the finish LSN of the failed transaction in the
+   server log if required.
+  </para>

This text made no sense to me. Can you reword it?

IIUC it means something like this:
When the streaming mode is 'apply', the finish LSN of failed
transactions may not be logged. In that case, it may be necessary to
change the streaming mode and cause the same conflicts again so the
finish LSN of the failed transaction will be written to the server
log.

======

6. doc/src/sgml/protocol.sgml

Since there are protocol changes made here, shouldn’t there also be
some corresponding LOGICALREP_PROTO_XXX constants and special checking
added in the worker.c?

======

7. doc/src/sgml/ref/create_subscription.sgml

+          for this subscription.  The default value is <literal>off</literal>,
+          all transactions are fully decoded on the publisher and only then
+          sent to the subscriber as a whole.
+         </para>

SUGGESTION
The default value is off, meaning all transactions are fully decoded
on the publisher and only then sent to the subscriber as a whole.

~~~

8. doc/src/sgml/ref/create_subscription.sgml

+         <para>
+          If set to <literal>on</literal>, the changes of transaction are
+          written to temporary files and then applied at once after the
+          transaction is committed on the publisher.
+         </para>

SUGGESTION
If set to on, the incoming changes are written to a temporary file and
then applied only after the transaction is committed on the publisher.

~~~

9.  doc/src/sgml/ref/create_subscription.sgml

+         <para>
+          If set to <literal>apply</literal> incoming
+          changes are directly applied via one of the background workers, if
+          available. If no background worker is free to handle streaming
+          transaction then the changes are written to a file and applied after
+          the transaction is committed. Note that if an error happens when
+          applying changes in a background worker, it might not report the
+          finish LSN of the remote transaction in the server log.
          </para>

SUGGESTION
If set to apply, the  incoming changes are directly applied via one of
the apply background workers, if available. If no background worker is
free to handle streaming transactions then the changes are written to
a file and applied after the transaction is committed. Note that if an
error happens when applying changes in a background worker, the finish
LSN of the remote transaction might not be reported in the server log.

======

10. src/backend/access/transam/xact.c

@@ -1741,6 +1742,13 @@ RecordTransactionAbort(bool isSubXact)
  elog(PANIC, "cannot abort transaction %u, it was already committed",
  xid);

+ /*
+ * Are we using the replication origins feature?  Or, in other words,
+ * are we replaying remote actions?
+ */
+ replorigin = (replorigin_session_origin != InvalidRepOriginId &&
+   replorigin_session_origin != DoNotReplicateId);
+
  /* Fetch the data we need for the abort record */
  nrels = smgrGetPendingDeletes(false, &rels);
  nchildren = xactGetCommittedChildren(&children);
@@ -1765,6 +1773,11 @@ RecordTransactionAbort(bool isSubXact)
     MyXactFlags, InvalidTransactionId,
     NULL);

+ if (replorigin)
+ /* Move LSNs forward for this replication origin */
+ replorigin_session_advance(replorigin_session_origin_lsn,
+    XactLastRecEnd);
+

I did not see any reason why the code assigning the 'replorigin' and
the code checking the 'replorigin' are separated like they are. I
thought these 2 new code fragments should be kept together. Perhaps it
was decided this assignment must be outside the critical section? But
if that’s the case maybe a comment explaining so would be good.

~~~

11. src/backend/access/transam/xact.c

+ if (replorigin)
+ /* Move LSNs forward for this replication origin */
+ replorigin_session_advance(replorigin_session_origin_lsn,
+

The positioning of that comment is unusual. Maybe better before the check?

======

12. src/backend/commands/subscriptioncmds.c - defGetStreamingMode

+ /*
+ * If no parameter given, assume "true" is meant.
+ */
+ if (def->arg == NULL)
+ return SUBSTREAM_ON;

SUGGESTION for comment
If the streaming parameter is given but no parameter value is
specified, then assume "true" is meant.

~~~

13. src/backend/commands/subscriptioncmds.c - defGetStreamingMode

+ /*
+ * Allow 0, 1, "true", "false", "on", "off" or "apply".
+ */

IMO these should be in a order consistent with the code.

SUGGESTION
Allow 0, 1, “false”, "true",  “off”, "on", or "apply".

======

14. src/backend/replication/logical/Makefile

- worker.o
+ worker.o \
+ applybgwroker.o

typo "applybgwroker" -> "applybgworker"

======

15. .../replication/logical/applybgwroker.c

+/*-------------------------------------------------------------------------
+ * applybgwroker.c
+ *     Support routines for applying xact by apply background worker
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *   src/backend/replication/logical/applybgwroker.c

15a.
Typo in filename: "applybgwroker" -> "applybgworker"

15b.
Typo in file header comment: "applybgwroker" -> "applybgworker"

~~~

16. .../replication/logical/applybgwroker.c

+/*
+ * entry for a hash table we use to map from xid to our apply background worker
+ * state.
+ */
+typedef struct ApplyBgworkerEntry

Comment should start uppercase.

~~~

17. .../replication/logical/applybgwroker.c

+/*
+ * Fields to record the share informations between main apply worker and apply
+ * background worker.
+ */

SUGGESTION
Information shared between main apply worker and apply background worker.

~~~

18.  .../replication/logical/applybgwroker.c

+/* apply background worker setup */
+static ApplyBgworkerState *apply_bgworker_setup(void);
+static void apply_bgworker_setup_dsm(ApplyBgworkerState *wstate);

IMO there was not really any need for this comment – these are just
function forward declares.

~~~

19.   .../replication/logical/applybgwroker.c - find_or_start_apply_bgworker

+ if (found)
+ {
+ entry->wstate->pstate->status = APPLY_BGWORKER_BUSY;
+ return entry->wstate;
+ }
+ else if (!start)
+ return NULL;

I felt this might be more readable without the else:

if (found)
{
entry->wstate->pstate->status = APPLY_BGWORKER_BUSY;
return entry->wstate;
}
Assert(!found)
if (!start)
return NULL;

~~~

20. .../replication/logical/applybgwroker.c - find_or_start_apply_bgworker

+ /*
+ * Now, we try to get a apply background worker. If there is at least one
+ * worker in the idle list, then take one. Otherwise, we try to start a
+ * new apply background worker.
+ */

20a.
"a apply" -> "an apply"

20b.
IMO it's better to call this the free list (not the idle list)

~~~

21. .../replication/logical/applybgwroker.c - find_or_start_apply_bgworker

+ /*
+ * If the apply background worker cannot be launched, remove entry
+ * in hash table.
+ */

"remove entry in hash table" -> "remove the entry from the hash table"

~~~

22. .../replication/logical/applybgwroker.c - apply_bgworker_free

+/*
+ * Add the worker to the free list and remove the entry from hash table.
+ */

"from hash table" -> "from the hash table"

~~~

23. .../replication/logical/applybgwroker.c - apply_bgworker_free

+ elog(DEBUG1, "adding finished apply worker #%u for xid %u to the idle list",
+ wstate->pstate->n, wstate->pstate->stream_xid);

IMO it's better to call this the free list (not the idle list)

~~~

24. .../replication/logical/applybgwroker.c - LogicalApplyBgwLoop

+/* Apply Background Worker main loop */
+static void
+LogicalApplyBgwLoop(shm_mq_handle *mqh, volatile ApplyBgworkerShared *pst)

Why is the name incosistent with other function names in the file?
Should it be apply_bgworker_loop?

~~~

25. .../replication/logical/applybgwroker.c - LogicalApplyBgwLoop

+ /*
+ * Push apply error context callback. Fields will be filled during
+ * applying a change.
+ */

"during" -> "when"

~~~

26. .../replication/logical/applybgwroker.c - LogicalApplyBgwLoop

+ /*
+ * We use first byte of message for additional communication between
+ * main Logical replication worker and apply bgworkers, so if it
+ * differs from 'w', then process it first.
+ */

"bgworkers" -> "background workers"

~~~

27. .../replication/logical/applybgwroker.c - ApplyBgwShutdown

For consistency should it be called apply_bgworker_shutdown?

~~~

28. .../replication/logical/applybgwroker.c - LogicalApplyBgwMain

For consistency should it be called apply_bgworker_main?

~~~

29. .../replication/logical/applybgwroker.c - apply_bgworker_check_status

+ errdetail("Cannot handle streamed replication transaction by apply "
+    "bgworkers until all tables are synchronized")));

"bgworkers" -> "background workers"

======

30. src/backend/replication/logical/decode.c

@@ -651,9 +651,10 @@ DecodeCommit(LogicalDecodingContext *ctx,
XLogRecordBuffer *buf,
  {
  for (i = 0; i < parsed->nsubxacts; i++)
  {
- ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
+ ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr,
+ commit_time);
  }
- ReorderBufferForget(ctx->reorder, xid, buf->origptr);
+ ReorderBufferForget(ctx->reorder, xid, buf->origptr, commit_time);

ReorderBufferForget was declared with 'abort_time' param. So it makes
these calls a bit confusing looking to be passing 'commit_time'

Maybe better to do like below and pass 'forget_time' (inside that
'if') along with an explanatory comment:

TimestampTz forget_time = commit_time;

======

31. src/backend/replication/logical/launcher.c - logicalrep_worker_find

+ /* We only need main apply worker or table sync worker here */

"need" -> "are interested in the"

~~~

32. src/backend/replication/logical/launcher.c - logicalrep_worker_launch

+ if (!is_subworker)
+ snprintf(bgw.bgw_function_name, BGW_MAXLEN, "ApplyWorkerMain");
+ else
+ snprintf(bgw.bgw_function_name, BGW_MAXLEN, "ApplyBgworkerMain");

IMO better to reverse this and express the condition as 'if (is_subworker)'

~~~

33. src/backend/replication/logical/launcher.c - logicalrep_worker_launch

+ else if (!is_subworker)
  snprintf(bgw.bgw_name, BGW_MAXLEN,
  "logical replication worker for subscription %u", subid);
+ else
+ snprintf(bgw.bgw_name, BGW_MAXLEN,
+ "logical replication background apply worker for subscription %u ", subid);

33a.
Ditto. IMO better to reverse this and express the condition as 'if
(is_subworker)'

33b.
"background apply worker" -> "apply background worker"

~~~

34. src/backend/replication/logical/launcher.c - logicalrep_worker_stop

IMO this code logic should be rewritten to be simpler to have a common
LWLockRelease. This also makes the code more like
logicalrep_worker_detach, which seems like a good thing.

SUGGESTION
logicalrep_worker_stop(Oid subid, Oid relid)
{
LogicalRepWorker *worker;

LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);

worker = logicalrep_worker_find(subid, relid, false);

if (worker)
    logicalrep_worker_stop_internal(worker);

LWLockRelease(LogicalRepWorkerLock);
}

~~~

35. src/backend/replication/logical/launcher.c -
logicalrep_apply_background_worker_count

+/*
+ * Count the number of registered (not necessarily running) apply background
+ * worker for a subscription.
+ */

"worker" -> "workers"

~~~

36. src/backend/replication/logical/launcher.c -
logicalrep_apply_background_worker_count

+ int res = 0;
+

A better variable name here would be 'count', or even 'n'.

======

36. src/backend/replication/logical/origin.c

+ * However, If must_acquire is false, we allow process to get the slot which is
+ * already acquired by other process.

SUGGESTION
However, if the function parameter 'must_acquire' is false, we allow
the process to use the same slot already acquired by another process.

~~~

37. src/backend/replication/logical/origin.c

+ ereport(ERROR,
+ (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
+ errmsg("could not find correct replication state slot for
replication origin with OID %u for apply background worker",
+ node),
+ errhint("There is no replication state slot set by its main apply worker.")));

37a.
Somehow, I felt the errmsg and the errhint could be clearer. Maybe like this?

" apply background worker could not find replication state slot for
replication origin with OID %u",

"There is no replication state slot set by the main apply worker."

37b.
Also, I think thet generally the 'errhint' informs some advice or some
action that the user can take to fix the problem. But is this errhint
actually saying anything useful for the user? Perhaps you meant to say
'errdetail' here?

======

38. src/backend/replication/logical/proto.c - logicalrep_read_stream_abort

+ /*
+ * If the version of the publisher is lower than the version of the
+ * subscriber, it may not support sending these two fields, so only take
+ * these fields when include_abort_lsn is true.
+ */
+ if (include_abort_lsn)
+ {
+ abort_data->abort_lsn = pq_getmsgint64(in);
+ abort_data->abort_time = pq_getmsgint64(in);
+ }
+ else
+ {
+ abort_data->abort_lsn = InvalidXLogRecPtr;
+ abort_data->abort_time = 0;
+ }

This comment is documenting a decision that was made elsewhere.

But it somehow feels wrong to me that the decision to read or not read
the abort time/lsn is made by the caller of this function. IMO it
might make more sense if the server version was simply passed as a
param and then this function can be in control of its own destiny and
make the decision does it need to read those extra fields or not. An
extra member flag can be added to LogicalRepStreamAbortData to
indicate if abort_data read these values or not.

======

39. src/backend/replication/logical/worker.c

  * Streamed transactions (large transactions exceeding a memory limit on the
- * upstream) are not applied immediately, but instead, the data is written
- * to temporary files and then applied at once when the final commit arrives.
+ * upstream) are applied via one of two approaches.

"via" -> "using"

~~~

40.  src/backend/replication/logical/worker.c

+ * Assign a new apply background worker (if available) as soon as the xact's
+ * first stream is received and the main apply worker will send changes to this
+ * new worker via shared memory. We keep this worker assigned till the
+ * transaction commit is received and also wait for the worker to finish at
+ * commit. This preserves commit ordering and avoids writing to and reading
+ * from file in most cases. We still need to spill if there is no worker
+ * available. We also need to allow stream_stop to complete by the background
+ * worker to avoid deadlocks because T-1's current stream of changes can update
+ * rows in conflicting order with T-2's next stream of changes.

40a.
"and the main apply -> ". The main apply"

40b.
"and avoids writing to and reading from file in most cases." -> "and
avoids file I/O in most cases."

40c.
"We still need to spill if" -> "We still need to spill to a file if"

40d.
"We also need to allow stream_stop to complete by the background
worker" -> "We also need to allow stream_stop to be performed by the
background worker"

~~~

41.  src/backend/replication/logical/worker.c

-static ApplyErrorCallbackArg apply_error_callback_arg =
+ApplyErrorCallbackArg apply_error_callback_arg =
 {
  .command = 0,
  .rel = NULL,
@@ -242,7 +246,7 @@ static ApplyErrorCallbackArg apply_error_callback_arg =
  .origin_name = NULL,
 };

Maybe it is still a good idea to at least keep the old comment here:
/* Struct for saving and restoring apply errcontext information */

~~

42.  src/backend/replication/logical/worker.c

+/* check if we are applying the transaction in apply background worker */
+#define apply_bgworker_active() (in_streamed_transaction &&
stream_apply_worker != NULL)

42a.
Uppercase comment.

42b.
"in apply background worker" -> "in apply background worker"

~~~

43.  src/backend/replication/logical/worker.c  - handle_streamed_transaction

@@ -426,41 +437,76 @@ end_replication_step(void)
 }

 /*
- * Handle streamed transactions.
+ * Handle streamed transactions for both main apply worker and apply background
+ * worker.
  *
- * If in streaming mode (receiving a block of streamed transaction), we
- * simply redirect it to a file for the proper toplevel transaction.
+ * In streaming case (receiving a block of streamed transaction), for
+ * SUBSTREAM_ON mode, we simply redirect it to a file for the proper toplevel
+ * transaction, and for SUBSTREAM_APPLY mode, we send the changes to background
+ * apply worker (LOGICAL_REP_MSG_RELATION or LOGICAL_REP_MSG_TYPE changes will
+ * also be applied in main apply worker).
  *
- * Returns true for streamed transactions, false otherwise (regular mode).
+ * For non-streamed transactions, returns false;
+ * For streamed transactions, returns true if in main apply worker (except we
+ * apply streamed transaction in "apply" mode and address
+ * LOGICAL_REP_MSG_RELATION or LOGICAL_REP_MSG_TYPE changes), false otherwise.
  */

Maybe it is accurate (I don’t know), but this header comment seems
excessively complicated with so many quirks about when to return
true/false. Can it be reworded into plainer language?

~~~

44.  src/backend/replication/logical/worker.c - handle_streamed_transaction

Because there are so many returns for each of these conditions,
consider refactoring the logic to change all the if/else to just be
"if" and then you can comment each separate cases better. I think it
may be clearer.

SUGGESTION

/* This is the apply background worker */
if (am_apply_bgworker())
{
...
return false;
}

/* This is the main apply, but there is an apply background worker */
if (apply_bgworker_active())
{
...
return true;
}

/* This is the main apply, and there is no apply background worker */
...
return true;

~~~

45.  src/backend/replication/logical/worker.c - apply_handle_stream_prepare

+ /*
+ * This is the main apply worker. Check if we are processing this
+ * transaction in a apply background worker.
+ */
+ if (wstate)

I think the part that says "This is the main apply worker" should be
at the top of the 'else'

~~~

46.  src/backend/replication/logical/worker.c - apply_handle_stream_prepare

+ /*
+ * This is the main apply worker and the transaction has been
+ * serialized to file, replay all the spooled operations.
+ */

SUGGESTION
The transaction has been serialized to file. Replay all the spooled operations.

~~~

47.  src/backend/replication/logical/worker.c - apply_handle_stream_prepare

+ /* unlink the files with serialized changes and subxact info. */
+ stream_cleanup_files(MyLogicalRepWorker->subid, prepare_data.xid);

Start comment with capital letter.

~~~

48.  src/backend/replication/logical/worker.c - apply_handle_stream_start

+ /* If we are in a apply background worker, begin the transaction */
+ AcceptInvalidationMessages();
+ maybe_reread_subscription();

The "if we are" part of the comment is not needed because the fact the
code is inside am_apply_bgworker() makes this obvious anyway/

~~~

49.  src/backend/replication/logical/worker.c - apply_handle_stream_start

+ /* open the spool file for this transaction */
+ stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+

Start the comment uppercase.

+ /* if this is not the first segment, open existing subxact file */
+ if (!first_segment)
+ subxact_info_read(MyLogicalRepWorker->subid, stream_xid);

Start the comment uppercase.

~~~

50.  src/backend/replication/logical/worker.c - apply_handle_stream_abort

+ /* Check whether the publisher sends abort_lsn and abort_time. */
+ if (am_apply_bgworker())
+ include_abort_lsn = MyParallelState->server_version >= 150000;
+
+ logicalrep_read_stream_abort(s, &abort_data, include_abort_lsn);

Here is where I felt maybe just the server version could be passed so
the logicalrep_read_stream_abort could decide itself what message
parts needed to be read. Basically it seems strange that the message
contain parts which might not be read. I felt it is better to always
read the whole message then later you can choose what parts you are
interested in.

~~~

51.  src/backend/replication/logical/worker.c - apply_handle_stream_abort

+ /*
+ * This is the main apply worker. Check if we are processing this
+ * transaction in a apply background worker.
+ */

+ /*
+ * We are in main apply worker and the transaction has been serialized
+ * to file.
+ */

51a.
I thought the "This is the main apply worker" and "We are in main
apply worker" should just be be a comment top of this "else"

51b.
"a apply worker" -> "an apply worker"

51c.
There seemed to be some missing comment to say this logic is telling
the bgworker to abort and then waiting for it to do so.

~~~

52. src/backend/replication/logical/worker.c - apply_handle_stream_commit

I did not really understand why the patch relocates this function to
another place in the file. Can't it be left in the same place?

~~~

53. src/backend/replication/logical/worker.c - apply_handle_stream_commit

+ /*
+ * This is the main apply worker. Check if we are processing this
+ * transaction in an apply background worker.
+ */

I thought the top of the else should just say "This is the main apply worker."

Then the if (wstate) part should say “Check if we are processing this
transaction in an apply background worker, and if so tell it to
comment the message”/

~~~

54. src/backend/replication/logical/worker.c - apply_handle_stream_commit

+ /*
+ * This is the main apply worker and the transaction has been
+ * serialized to file, replay all the spooled operations.
+ */

SUGGESTION
The transaction has been serialized to file, so replay all the spooled
operations.

~~~

55. src/backend/replication/logical/worker.c - apply_handle_stream_commit

+ /* unlink the files with serialized changes and subxact info */
+ stream_cleanup_files(MyLogicalRepWorker->subid, xid);

Uppercase comment.

======

56. src/backend/utils/misc/guc.c

@@ -3220,6 +3220,18 @@ static struct config_int ConfigureNamesInt[] =
  NULL, NULL, NULL
  },

+ {
+ {"max_apply_bgworkers_per_subscription",
+ PGC_SIGHUP,
+ REPLICATION_SUBSCRIBERS,
+ gettext_noop("Maximum number of apply backgrand workers per subscription."),
+ NULL,
+ },
+ &max_apply_bgworkers_per_subscription,
+ 3, 0, MAX_BACKENDS,
+ NULL, NULL, NULL
+ },
+

"backgrand" -> "background"

======

57. src/include/catalog/pg_subscription.h

@@ -109,7 +110,7 @@ typedef struct Subscription
  bool enabled; /* Indicates if the subscription is enabled */
  bool binary; /* Indicates if the subscription wants data in
  * binary format */
- bool stream; /* Allow streaming in-progress transactions. */
+ char stream; /* Allow streaming in-progress transactions. */
  char twophasestate; /* Allow streaming two-phase transactions */
  bool disableonerr; /* Indicates if the subscription should be
  * automatically disabled if a worker error

I felt probably this 'stream' comment should be the same as for 'substream'.

======

58. src/include/replication/worker_internal.h

+/*
+ * Shared information among apply workers.
+ */
+typedef struct ApplyBgworkerShared

SUGGESTION (maybe you can do better than this)
Struct for sharing information between apply main and apply background workers.

~~~

59. src/include/replication/worker_internal.h

+ /* Status for apply background worker. */
+ ApplyBgworkerStatus status;

"Status for" -> "Status of"

~~~

60. src/include/replication/worker_internal.h

+extern PGDLLIMPORT MemoryContext ApplyMessageContext;
+
+extern PGDLLIMPORT ApplyErrorCallbackArg apply_error_callback_arg;
+
+extern PGDLLIMPORT bool MySubscriptionValid;
+
+extern PGDLLIMPORT volatile ApplyBgworkerShared *MyParallelState;
+extern PGDLLIMPORT List *subxactlist;
+

I did not recognise the significance why are the last 2 externs
grouped togeth but the others are not.

~~~

61. src/include/replication/worker_internal.h

+/* prototype needed because of stream_commit */
+extern void apply_dispatch(StringInfo s);

61a.
I was unsure if this comment is useful to anyone...

61b.
If you decide to keep it, please use uppercase.

~~~

62. src/include/replication/worker_internal.h

+/* apply background worker setup and interactions */
+extern ApplyBgworkerState *apply_bgworker_find_or_start(TransactionId xid,
+ bool start);

Uppercase comment.

======

63.

I also did a quick check of all the new debug logging added. Here is
everyhing from patch v11-0001.

apply_bgworker_free:
+ elog(DEBUG1, "adding finished apply worker #%u for xid %u to the idle list",
+ wstate->pstate->n, wstate->pstate->stream_xid);

LogicalApplyBgwLoop:
+ elog(DEBUG1, "[Apply BGW #%u] ended processing streaming chunk,"
+ "waiting on shm_mq_receive", pst->n);

+ elog(DEBUG1, "[Apply BGW #%u] exiting", pst->n);

ApplyBgworkerMain:
+ elog(DEBUG1, "[Apply BGW #%u] started", pst->n);

apply_bgworker_setup:
+ elog(DEBUG1, "setting up apply worker #%u",
list_length(ApplyWorkersList) + 1);

apply_bgworker_set_status:
+ elog(DEBUG1, "[Apply BGW #%u] set status to %d", MyParallelState->n, status);

apply_bgworker_subxact_info_add:
+ elog(DEBUG1, "[Apply BGW #%u] defining savepoint %s",
+ MyParallelState->n, spname);

apply_handle_stream_prepare:
+ elog(DEBUG1, "received prepare for streamed transaction %u",
+ prepare_data.xid);

apply_handle_stream_start:
+ elog(DEBUG1, "starting streaming of xid %u", stream_xid);

apply_handle_stream_stop:
+ elog(DEBUG1, "stopped streaming of xid %u, %u changes streamed",
stream_xid, nchanges);

apply_handle_stream_abort:
+ elog(DEBUG1, "[Apply BGW #%u] aborting current transaction xid=%u, subxid=%u",
+ MyParallelState->n, GetCurrentTransactionIdIfAny(),
+ GetCurrentSubTransactionId());

+ elog(DEBUG1, "[Apply BGW #%u] rolling back to savepoint %s",
+ MyParallelState->n, spname);

apply_handle_stream_commit:
+ elog(DEBUG1, "received commit for streamed transaction %u", xid);


Observations:

63a.
Every new introduced message is at level DEBUG1 (not DEBUG). AFAIK
this is OK, because the messages are all protocol related and every
other existing debug message of the current replication worker.c was
also at the same DEBUG1 level.

63b.
The prefix "[Apply BGW #%u]" is used to indicate the bgworker is
executing the code, but it does not seem to be used 100% consistently
- e.g. there are some apply_bgworker_XXX functions not using this
prefix. Is that OK or a mistake?

------
Kind Regards,
Peter Smith.
Fujitsu Austrlia

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

21 June 2022, 04:24:14

On Tue, Jun 21, 2022 at 7:11 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> Here are some review comments for the v11-0001 patch.
>
> (I will review the remaining patches 0002-0005 and post any comments later)
>
> ======
>
> 1. General
>
> I still feel that 'apply' seems like a meaningless enum value for this
> feature because from a user point-of-view every replicated change gets
> "applied". IMO something like 'streaming = parallel' or 'streaming =
> background' (etc) might have more meaning for a user.
>

+1. I would prefer 'streaming = parallel' as that suits here because
we allow streams (set of changes) of a transaction to be applied in
parallel to other transactions or in parallel to a stream of changes
from another streaming transaction.

> ======
>
> 10. src/backend/access/transam/xact.c
>
> @@ -1741,6 +1742,13 @@ RecordTransactionAbort(bool isSubXact)
>   elog(PANIC, "cannot abort transaction %u, it was already committed",
>   xid);
>
> + /*
> + * Are we using the replication origins feature?  Or, in other words,
> + * are we replaying remote actions?
> + */
> + replorigin = (replorigin_session_origin != InvalidRepOriginId &&
> +   replorigin_session_origin != DoNotReplicateId);
> +
>   /* Fetch the data we need for the abort record */
>   nrels = smgrGetPendingDeletes(false, &rels);
>   nchildren = xactGetCommittedChildren(&children);
> @@ -1765,6 +1773,11 @@ RecordTransactionAbort(bool isSubXact)
>      MyXactFlags, InvalidTransactionId,
>      NULL);
>
> + if (replorigin)
> + /* Move LSNs forward for this replication origin */
> + replorigin_session_advance(replorigin_session_origin_lsn,
> +    XactLastRecEnd);
> +
>
> I did not see any reason why the code assigning the 'replorigin' and
> the code checking the 'replorigin' are separated like they are. I
> thought these 2 new code fragments should be kept together. Perhaps it
> was decided this assignment must be outside the critical section? But
> if that’s the case maybe a comment explaining so would be good.
>

I also don't see any particular reason for this apart from being
similar to RecordTransactionCommit(). I think it should be fine either
way.

> ~~~
>
> 11. src/backend/access/transam/xact.c
>
> + if (replorigin)
> + /* Move LSNs forward for this replication origin */
> + replorigin_session_advance(replorigin_session_origin_lsn,
> +
>
> The positioning of that comment is unusual. Maybe better before the check?
>

This again seems to be due to a similar code in
RecordTransactionCommit(). I would suggest let's keep the code
consistent.

--
With Regards,
Amit Kapila.

Re: Perform streaming logical transactions by background workers and parallel apply

From

Peter Smith

Date:

23 June 2022, 02:47:55

FYI - the latest patch set v12* on this thread no longer applies.

[postgres@CentOS7-x64 oss_postgres_misc]$ git apply
v12-0003-A-temporary-patch-that-includes-patch-in-another.patch
error: patch failed: src/backend/replication/logical/relation.c:307
error: src/backend/replication/logical/relation.c: patch does not apply
error: patch failed: src/backend/replication/logical/worker.c:2358
error: src/backend/replication/logical/worker.c: patch does not apply
error: patch failed: src/test/subscription/t/013_partition.pl:868
error: src/test/subscription/t/013_partition.pl: patch does not apply
[postgres@CentOS7-x64 oss_postgres_misc]$

~~

I know the v12-0003 was meant just a temporary patch for something
that may now already be pushed, but it cannot be just skipped either
because then v12-0004 will also fail.

[postgres@CentOS7-x64 oss_postgres_misc]$ git apply
v12-0004-Add-some-checks-before-using-apply-background-wo.patch
error: patch failed: src/backend/replication/logical/relation.c:433
error: src/backend/replication/logical/relation.c: patch does not apply
error: patch failed: src/backend/replication/logical/worker.c:2403
error: src/backend/replication/logical/worker.c: patch does not apply
[postgres@CentOS7-x64 oss_postgres_misc]$

------
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Perform streaming logical transactions by background workers and parallel apply

From

Peter Smith

Date:

23 June 2022, 06:50:03

Here are some review comments for v12-0002

======

1. Commit message

"streaming" option -> "streaming" parameter

~~~

2. General (every file in this patch)

"streaming" option -> "streaming" parameter

~~~

3. .../subscription/t/022_twophase_cascade.pl

For every test file in this patch the new function is passed $is_apply
= 0/1 to indicate to use 'on' or 'apply' parameter value. But in this
test file the parameter is passed as $streaming_mode = 'on'/'apply'.

I was wondering if (for the sake of consistency) it might be better to
use the same parameter kind for all of the test files. Actually, I
don't care if you choose to do nothing and leave this as-is; I am just
posting this review comment in case it was not a deliberate decision
to implement them differently.

e.g.
+ my ($node_publisher, $node_subscriber, $appname, $is_apply) = @_;

versus
+ my ($node_A, $node_B, $node_C, $appname_B, $appname_C, $streaming_mode) =
+   @_;

------
Kind Regards,
Peter Smith.
Fujitsu Australia

RE: Perform streaming logical transactions by background workers and parallel apply

From

"wangw.fnst@fujitsu.com"

Date:

23 June 2022, 07:21:43

On Mon, Jun 20, 2022 at 11:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> I have improved the comments in this and other related sections of the
> patch. See attached.
Thanks for your comments and patch!
Improved the comments as you suggested.

> > > 3.
> > > +
> > > +  <para>
> > > +   Setting streaming mode to <literal>apply</literal> could export invalid
> LSN
> > > +   as finish LSN of failed transaction. Changing the streaming mode and
> making
> > > +   the same conflict writes the finish LSN of the failed transaction in the
> > > +   server log if required.
> > > +  </para>
> > >
> > > How will the user identify that this is an invalid LSN value and she
> > > shouldn't use it to SKIP the transaction? Can we change the second
> > > sentence to: "User should change the streaming mode to 'on' if they
> > > would instead wish to see the finish LSN on error. Users can use
> > > finish LSN to SKIP applying the transaction." I think we can give
> > > reference to docs where the SKIP feature is explained.
> > Improved the sentence as suggested.
> >
> 
> You haven't answered first part of the comment: "How will the user
> identify that this is an invalid LSN value and she shouldn't use it to
> SKIP the transaction?". Have you checked what value it displays? For
> example, in one of the case in apply_error_callback as shown in below
> code, we don't even display finish LSN if it is invalid.
> else if (XLogRecPtrIsInvalid(errarg->finish_lsn))
> errcontext("processing remote data for replication origin \"%s\"
> during \"%s\" in transaction %u",
>    errarg->origin_name,
>    logicalrep_message_type(errarg->command),
>    errarg->remote_xid);
I am sorry that I missed something in my previous reply.
The invalid LSN value here is to say InvalidXLogRecPtr (0/0).
Here is an example :
```
2022-06-23 14:30:11.343 CST [822333] logical replication worker CONTEXT:  processing remote data for replication origin
"pg_16389"during "INSERT" for replication target relation "public.tab" in transaction 727 finished at 0/0
 
```
So I try to improve the sentence in pg-doc by changing from
```
Setting streaming mode to <literal>apply</literal> could export invalid LSN as
finish LSN of failed transaction.
```
to 
```
Setting streaming mode to <literal>apply</literal> could export invalid LSN
(0/0) as finish LSN of failed transaction.
```

I also improved the patches as you suggested in [1]:
> 1.
> +/*
> + * Count the number of registered (not necessarily running) apply background
> + * worker for a subscription.
> + */
> 
> /worker/workers
Improved as suggested.

> 2.
> +static void
> +apply_bgworker_setup_dsm(ApplyBgworkerState *wstate)
> +{
> ...
> ...
> + int64 queue_size = 160000000; /* 16 MB for now */
> 
> I think it would be better to use define for this rather than a
> hard-coded value.
Improved as suggested.
Added a macro like this:
```
/* queue size of DSM, 16 MB for now. */
#define DSM_QUEUE_SIZE    160000000
```

> 3.
> +/*
> + * Status for apply background worker.
> + */
> +typedef enum ApplyBgworkerStatus
> +{
> + APPLY_BGWORKER_ATTACHED = 0,
> + APPLY_BGWORKER_READY,
> + APPLY_BGWORKER_BUSY,
> + APPLY_BGWORKER_FINISHED,
> + APPLY_BGWORKER_EXIT
> +} ApplyBgworkerStatus;
> 
> It would be better if you can add comments to explain each of these states.
Improved as suggested.
Added the comments like below:
```
APPLY_BGWORKER_BUSY = 0,            /* assigned to a transaction */
APPLY_BGWORKER_FINISHED,        /* transaction is completed */
APPLY_BGWORKER_EXIT                /* exit */
```
In addition, after improving the point #7 as you suggested, I removed
"APPLY_BGWORKER_ATTACHED". And I removed "APPLY_BGWORKER_READY" in v12.

> 4.
> + /* Set up one message queue per worker, plus one. */
> + mq = shm_mq_create(shm_toc_allocate(toc, (Size) queue_size),
> +    (Size) queue_size);
> + shm_toc_insert(toc, APPLY_BGWORKER_KEY_MQ, mq);
> + shm_mq_set_sender(mq, MyProc);
> 
> 
> I don't understand the meaning of 'plus one' in the above comment as
> the patch seems to be setting up just one queue here?
Yes, you are right. Improved as below:
```
/* Set up message queue for the worker. */
```

> 5.
> +
> + /* Attach the queues. */
> + wstate->mq_handle = shm_mq_attach(mq, seg, NULL);
> 
> Similar to above. If there is only one queue then the comment should
> say queue instead of queues.
Improved as suggested.

> 6.
>   snprintf(bgw.bgw_name, BGW_MAXLEN,
>   "logical replication worker for subscription %u", subid);
> + else
> + snprintf(bgw.bgw_name, BGW_MAXLEN,
> + "logical replication background apply worker for subscription %u ", subid);
> 
> No need for extra space after %u in the above code.
Improved as suggested.

> 7.
> + launched = logicalrep_worker_launch(MyLogicalRepWorker->dbid,
> + MySubscription->oid,
> + MySubscription->name,
> + MyLogicalRepWorker->userid,
> + InvalidOid,
> + dsm_segment_handle(wstate->dsm_seg));
> +
> + if (launched)
> + {
> + /* Wait for worker to attach. */
> + apply_bgworker_wait_for(wstate, APPLY_BGWORKER_ATTACHED);
> 
> In logicalrep_worker_launch(), we already seem to be waiting for
> workers to attach via WaitForReplicationWorkerAttach(), so it is not
> clear to me why we need to wait again? If there is a genuine reason
> then it is better to add some comments to explain it. I think in some
> way, we need to know if the worker is successfully attached and we may
> not get that via WaitForReplicationWorkerAttach, so there needs to be
> some way to know that but this doesn't sound like a very good idea. If
> that understanding is correct then can we think of a better way?
Improved the related logic.
The reason we wait again here in previous version is to wait for apply bgworker
to attach the memory queue, but function WaitForReplicationWorkerAttach could
not do that.
Now to improve this, we invoke the function logicalrep_worker_attach after the
attaching the memory queue instead of before.
Also to make sure worker has not die due to error or some reasons, I modified
the function logicalrep_worker_launch and function
WaitForReplicationWorkerAttach. And then, we could judge whether the worker
started successfully or died according to the return value of the function
logicalrep_worker_launch.

> 8. I think we can simplify apply_bgworker_find_or_start by having
> separate APIs for find and start. Most of the places need to use find
> API except for the first stream. If we do that then I think you don't
> need to make a hash entry unless we established ApplyBgworkerState
> which currently looks odd as you need to remove the entry if we fail
> to allocate the state.
Improved as suggested.

> 9.
> + /*
> + * TO IMPROVE: Do we need to display the apply background worker's
> + * information in pg_stat_replication ?
> + */
> + UpdateWorkerStats(last_received, send_time, false);
> 
> In this do you mean to say pg_stat_subscription? If so, then to decide
> whether we need to update stats here we should see what additional
> information we can update here which is not possible via the main
> apply worker?
Yes, it should be pg_stat_subscription. I think we do not need to update these
statistics here.
I think the messages received in function LogicalApplyBgwLoop in apply bgworker
have handled in function LogicalRepApplyLoop in apply worker, these statistics
have been updated. (see function LogicalRepApplyLoop)

> 10.
> ApplyBgworkerMain
> {
> ...
> + /* Load the subscription into persistent memory context. */
> + ApplyContext = AllocSetContextCreate(TopMemoryContext,
> ...
> 
> This comment seems to be copied from ApplyWorkerMain but doesn't apply
> here.
Yes, you are right. Improved as below:
```
/* Init the memory context for the apply background worker to work in. */
```

In addition, I also tried to improve the patches by following points:
a.
In the function apply_handle_stream_abort, when invoking the function
set_apply_error_context_xact, I forgot to change the second input parameter.
So changed "InvalidXLogRecPtr" to "abort_lsn".
b.
Improved the function name from "canstartapplybgworker" to
"apply_bgworker_can_start".
c.
Detach the dsm segment if we fail to launch a apply bgworker. (see function
apply_bgworker_setup)

BTW, I deleted the temporary patch 0003 (v12) and rebased patches because the
commit 26b3455afa and ac0e2d387a in HEAD.
And now, I am improving the patches as suggested by Peter-san in [3]. I will
send new patches soon.

Attach the new patches.

[1] - https://www.postgresql.org/message-id/CAA4eK1%2BQQHGb0afmM_Cf2qu%3DUJoCnvs3VcZ%2B1xTiySx205fU1w%40mail.gmail.com
[2] -
https://www.postgresql.org/message-id/OS3PR01MB6275208A2F8ED832710F65E09EA49%40OS3PR01MB6275.jpnprd01.prod.outlook.com
[3] - https://www.postgresql.org/message-id/CAHut%2BPtu_eWOVWAKrwkUFdTAh_r-RZsbDFkFmKwEAmxws%3DSh5w%40mail.gmail.com

Regards,
Wang wei

Attachment

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

23 June 2022, 08:43:54

On Thu, Jun 23, 2022 at 12:51 PM wangw.fnst@fujitsu.com
<wangw.fnst@fujitsu.com> wrote:
>
> On Mon, Jun 20, 2022 at 11:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I have improved the comments in this and other related sections of the
> > patch. See attached.
> Thanks for your comments and patch!
> Improved the comments as you suggested.
>
> > > > 3.
> > > > +
> > > > +  <para>
> > > > +   Setting streaming mode to <literal>apply</literal> could export invalid
> > LSN
> > > > +   as finish LSN of failed transaction. Changing the streaming mode and
> > making
> > > > +   the same conflict writes the finish LSN of the failed transaction in the
> > > > +   server log if required.
> > > > +  </para>
> > > >
> > > > How will the user identify that this is an invalid LSN value and she
> > > > shouldn't use it to SKIP the transaction? Can we change the second
> > > > sentence to: "User should change the streaming mode to 'on' if they
> > > > would instead wish to see the finish LSN on error. Users can use
> > > > finish LSN to SKIP applying the transaction." I think we can give
> > > > reference to docs where the SKIP feature is explained.
> > > Improved the sentence as suggested.
> > >
> >
> > You haven't answered first part of the comment: "How will the user
> > identify that this is an invalid LSN value and she shouldn't use it to
> > SKIP the transaction?". Have you checked what value it displays? For
> > example, in one of the case in apply_error_callback as shown in below
> > code, we don't even display finish LSN if it is invalid.
> > else if (XLogRecPtrIsInvalid(errarg->finish_lsn))
> > errcontext("processing remote data for replication origin \"%s\"
> > during \"%s\" in transaction %u",
> >    errarg->origin_name,
> >    logicalrep_message_type(errarg->command),
> >    errarg->remote_xid);
> I am sorry that I missed something in my previous reply.
> The invalid LSN value here is to say InvalidXLogRecPtr (0/0).
> Here is an example :
> ```
> 2022-06-23 14:30:11.343 CST [822333] logical replication worker CONTEXT:  processing remote data for replication
origin"pg_16389" during "INSERT" for replication target relation "public.tab" in transaction 727 finished at 0/0
 
> ```
>

I don't think it is a good idea to display invalid values. We can mask
this as we are doing in other cases in function apply_error_callback.
The ideal way is that we provide a view/system table for users to
check these errors but that is a matter of another patch. So users
probably need to check Logs to see if the error is from a background
apply worker to decide whether or not to switch streaming mode.

-- 
With Regards,
Amit Kapila.

RE: Perform streaming logical transactions by background workers and parallel apply

From

"wangw.fnst@fujitsu.com"

Date:

28 June 2022, 03:21:33

On Thu, Jun 23, 2022 at 16:44 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Jun 23, 2022 at 12:51 PM wangw.fnst@fujitsu.com
> <wangw.fnst@fujitsu.com> wrote:
> >
> > On Mon, Jun 20, 2022 at 11:00 AM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > > I have improved the comments in this and other related sections of the
> > > patch. See attached.
> > Thanks for your comments and patch!
> > Improved the comments as you suggested.
> >
> > > > > 3.
> > > > > +
> > > > > +  <para>
> > > > > +   Setting streaming mode to <literal>apply</literal> could export invalid
> > > LSN
> > > > > +   as finish LSN of failed transaction. Changing the streaming mode and
> > > making
> > > > > +   the same conflict writes the finish LSN of the failed transaction in the
> > > > > +   server log if required.
> > > > > +  </para>
> > > > >
> > > > > How will the user identify that this is an invalid LSN value and she
> > > > > shouldn't use it to SKIP the transaction? Can we change the second
> > > > > sentence to: "User should change the streaming mode to 'on' if they
> > > > > would instead wish to see the finish LSN on error. Users can use
> > > > > finish LSN to SKIP applying the transaction." I think we can give
> > > > > reference to docs where the SKIP feature is explained.
> > > > Improved the sentence as suggested.
> > > >
> > >
> > > You haven't answered first part of the comment: "How will the user
> > > identify that this is an invalid LSN value and she shouldn't use it to
> > > SKIP the transaction?". Have you checked what value it displays? For
> > > example, in one of the case in apply_error_callback as shown in below
> > > code, we don't even display finish LSN if it is invalid.
> > > else if (XLogRecPtrIsInvalid(errarg->finish_lsn))
> > > errcontext("processing remote data for replication origin \"%s\"
> > > during \"%s\" in transaction %u",
> > >    errarg->origin_name,
> > >    logicalrep_message_type(errarg->command),
> > >    errarg->remote_xid);
> > I am sorry that I missed something in my previous reply.
> > The invalid LSN value here is to say InvalidXLogRecPtr (0/0).
> > Here is an example :
> > ```
> > 2022-06-23 14:30:11.343 CST [822333] logical replication worker CONTEXT:
> processing remote data for replication origin "pg_16389" during "INSERT" for
> replication target relation "public.tab" in transaction 727 finished at 0/0
> > ```
> >
> 
> I don't think it is a good idea to display invalid values. We can mask
> this as we are doing in other cases in function apply_error_callback.
> The ideal way is that we provide a view/system table for users to
> check these errors but that is a matter of another patch. So users
> probably need to check Logs to see if the error is from a background
> apply worker to decide whether or not to switch streaming mode.

Thanks for your comments.
I improved it as you suggested. I mask the LSN if it is invalid LSN(0/0).
Also, I improved the related pg-doc as following:
```
   When the streaming mode is <literal>parallel</literal>, the finish LSN of
   failed transactions may not be logged. In that case, it may be necessary to
   change the streaming mode to <literal>on</literal> and cause the same
   conflicts again so the finish LSN of the failed transaction will be written
   to the server log. For the usage of finish LSN, please refer to <link
   linkend="sql-altersubscription"><command>ALTER SUBSCRIPTION ...
   SKIP</command></link>.
```
After improving this (mask invalid LSN), I found that this improvement and
parallel apply patch do not seem to have a strong correlation. Would it be
better to improve and commit in another separate patch?


I also improved patches as suggested by Peter-san in [1] and [2].
Thanks for Shi Yu to improve the patches by addressing the comments in [2].

Attach the new patches.

[1] - https://www.postgresql.org/message-id/CAHut%2BPtu_eWOVWAKrwkUFdTAh_r-RZsbDFkFmKwEAmxws%3DSh5w%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAHut%2BPsDzRu6PD1uSRkftRXef-KwrOoYrcq7Cm0v4otisi5M%2Bg%40mail.gmail.com

Regards,
Wang wei

On Fri, Jul 1, 2022 at 14:43 PM Peter Smith <smithpb2250@gmail.com> wrote:
> Below are some review comments for patches v14-0001, and v14-0002:

Thanks for your comments.

> 1.10 .../replication/logical/applybgworker.c - apply_bgworker_find
> 
> + /*
> + * Find entry for requested transaction.
> + */
> + entry = hash_search(ApplyWorkersHash, &xid, HASH_FIND, &found);
> + if (found)
> + {
> + entry->wstate->pstate->status = APPLY_BGWORKER_BUSY;
> + return entry->wstate;
> + }
> + else
> + return NULL;
> +}
> 
> IMO it is an unexpected side-effect for the function called "find" to
> be also modifying the thing that it found. IMO this setting BUSY
> should either be done by the caller, or else this function name should
> be renamed to make it obvious that this is doing more than just
> "finding" something.

Since we set the state to BUSY in the function apply_bgworker_start and the
state is not modified (set to FINISHED) until the transaction completes, I
think we do not need to set this state to BUSY again in the function
apply_bgworker_find during applying the transaction.
So I removed it and invoked function Assert.
I also invoked function Assert in function apply_bgworker_start.

> 1.16. src/backend/replication/logical/launcher.c - logicalrep_worker_launch
> 
> + bool is_subworker = (subworker_dsm != DSM_HANDLE_INVALID);
> +
> + /* We don't support table sync in subworker */
> + Assert(!(is_subworker && OidIsValid(relid)));
> 
> I'm not sure the comment is good. It sounds like it is something that
> might be possible but is just current "not supported". In fact, I
> thought this is really just a sanity check because the combination of
> those params is just plain wrong isn't it? Maybe a better comment is
> just:
> /* Sanity check */

Improved this comment as following:
```
/* Sanity check : we don't support table sync in subworker. */
```

> 1.22 src/backend/replication/logical/worker.c - skip_xact_finish_lsn
> 
>  /*
>   * We enable skipping all data modification changes (INSERT, UPDATE, etc.) for
>   * the subscription if the remote transaction's finish LSN matches
> the subskiplsn.
>   * Once we start skipping changes, we don't stop it until we skip all
> changes of
>   * the transaction even if pg_subscription is updated and
> MySubscription->skiplsn
> - * gets changed or reset during that. Also, in streaming transaction cases, we
> - * don't skip receiving and spooling the changes since we decide whether or not
> + * gets changed or reset during that. Also, in streaming transaction
> cases (streaming = on),
> + * we don't skip receiving and spooling the changes since we decide
> whether or not
>   * to skip applying the changes when starting to apply changes. The
> subskiplsn is
>   * cleared after successfully skipping the transaction or applying non-empty
>   * transaction. The latter prevents the mistakenly specified subskiplsn from
> - * being left.
> + * being left. Note that we cannot skip the streaming transaction in parallel
> + * mode, because we cannot get the finish LSN before applying the changes.
>   */
> 
> "in parallel mode, because" -> "in 'streaming = parallel' mode, because"

Not sure about this.

> 1.28 src/backend/replication/logical/worker.c - apply_handle_stream_prepare
> 
> + if (wstate)
> + {
> + apply_bgworker_send_data(wstate, s->len, s->data);
> +
> + /*
> + * Wait for apply background worker to finish. This is required to
> + * maintain commit order which avoids failures due to transaction
> + * dependencies and deadlocks.
> + */
> + apply_bgworker_wait_for(wstate, APPLY_BGWORKER_FINISHED);
> + apply_bgworker_free(wstate);
> 
> I think maybe the comment can be changed slightly, and then it can
> move up one line to the top of this code block (above the 3
> statements). I think it will become more readable.
> 
> SUGGESTION
> After sending the data to the apply background worker, wait for that
> worker to finish. This is necessary to maintain commit order which
> avoids failures due to transaction dependencies and deadlocks.

I think it might be better to add a new comment before invoking function
apply_bgworker_send_data. Improve the comments as you suggested.
I improved this point in function apply_handle_stream_prepare,
apply_handle_stream_abort and apply_handle_stream_commit. What do you think
about changing it like this:
```
/* Send STREAM PREPARE message to the apply background worker. */
apply_bgworker_send_data(wstate, s->len, s->data);

/*
 * After sending the data to the apply background worker, wait for
 * that worker to finish. This is necessary to maintain commit
 * order which avoids failures due to transaction dependencies and
 * deadlocks.
 */
apply_bgworker_wait_for(wstate, APPLY_BGWORKER_FINISHED);
```

> 1.34 src/backend/replication/logical/worker.c - apply_dispatch
> 
> -
>  /*
>   * Logical replication protocol message dispatcher.
>   */
> -static void
> +void
>  apply_dispatch(StringInfo s)
> 
> Maybe removing the whitespace is not really needed as part of this patch?

Yes, this change is not necessary for this patch.
But since this change does not involve the modification of comments and actual
code, it just adjusts the blank line between the function modified by this
patch and the previous function, so I think it is okay in this patch.

> 2.1 Commit message
> 
> Change all TAP tests using the SUBSCRIPTION "streaming" option, so they
> now test both 'on' and 'parallel' values.
> 
> "option" -> "parameter"

Sorry I missed this point when I was merging the patches. I merged this change
in v15.

Attach the new patches.
Also improved the patches as suggested in [1], [2] and [3].

[1] - https://www.postgresql.org/message-id/CAA4eK1KgovaRcbSuzzWki1HVso6oLAdZ2aPr1nWxX1x%3DVDBQJg%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAHut%2BPtRNAOwFtBp_TnDWdC7UpcTxPJzQnrm%3DNytN7cVBt5zRQ%40mail.gmail.com
[3] - https://www.postgresql.org/message-id/CAHut%2BPvrw%2BtgCEYGxv%2BnKrqg-zbJdYEXee6o4irPAsYoXcuUcw%40mail.gmail.com

Regards,
Wang wei

On Fri, Jul 7, 2022 at 11:44 AM I wrote:
> Attach the new patches.

I found a failure on CFbot [1], which after investigation I think is due to my
previous modification (see response to #1.10 in [2]).

For a streaming transaction, if we failed in the first chunk of streamed
changes for this transaction in the apply background worker, we will set the
status of this apply background worker to APPLY_BGWORKER_EXIT. 
And at the same time, main apply worker obtains apply background worker
in the function apply_bgworker_find when processing the second chunk of
streamed changes for this transaction, the status of apply background worker
is APPLY_BGWORKER_EXIT. So the following assertion will fail:
```
Assert(status == APPLY_BGWORKER_BUSY);
```

To fix this, before invoking function assert, I try to detect the failure of
apply background worker. If the status is APPLY_BGWORKER_EXIT, then exit with
an error.

I also made some other small improvements.

Attach the new patches.

[1] - https://cirrus-ci.com/task/6383178511286272?logs=test_world#L2636
[2] -
https://www.postgresql.org/message-id/OS3PR01MB62755C6C9A75EB09F7218B589E839%40OS3PR01MB6275.jpnprd01.prod.outlook.com

Regards,
Wang wei

Attachment

Re: Perform streaming logical transactions by background workers and parallel apply

From

Peter Smith

Date:

13 July 2022, 04:33:18

Below are my review comments for the v16* patch set:

========
v16-0001
========

1.0 <general>

There are places (comments, docs, errmsgs, etc) in the patch referring
to "parallel mode". I think every one of those references should be
found and renamed to "parallel streaming mode" or "streaming=parallel"
or at the very least match sure that "streaming" is in the same
sentence. IMO it's too vague just saying "parallel" without also
saying the context is for the "streaming" parameter.

I have commented on some of those examples below, but please search
everything anyway (including the docs) to catch the ones I haven't
explicitly mentioned.

======

1.1 src/backend/commands/subscriptioncmds.c

+defGetStreamingMode(DefElem *def)
+{
+ /*
+ * If no value given, assume "true" is meant.
+ */

Please fix this comment to identical to this pushed patch [1]

======

1.2 .../replication/logical/applybgworker.c - apply_bgworker_start

+ if (list_length(ApplyWorkersFreeList) > 0)
+ {
+ wstate = (ApplyBgworkerState *) llast(ApplyWorkersFreeList);
+ ApplyWorkersFreeList = list_delete_last(ApplyWorkersFreeList);
+ Assert(wstate->pstate->status == APPLY_BGWORKER_FINISHED);
+ }

The Assert that the entries in the free-list are FINISHED seems like
unnecessary checking. IIUC, code is already doing the Assert that
entries are FINISHED before allowing them into the free-list in the
first place.

~~~

1.3 .../replication/logical/applybgworker.c - apply_bgworker_find

+ if (found)
+ {
+ char status = entry->wstate->pstate->status;
+
+ /* If any workers (or the postmaster) have died, we have failed. */
+ if (status == APPLY_BGWORKER_EXIT)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("background worker %u failed to apply transaction %u",
+ entry->wstate->pstate->n,
+ entry->wstate->pstate->stream_xid)));
+
+ Assert(status == APPLY_BGWORKER_BUSY);
+
+ return entry->wstate;
+ }

Why not remove that Assert but change the condition to be:

if (status != APPLY_BGWORKER_BUSY)
ereport(...)

======

1.4 src/backend/replication/logical/proto.c - logicalrep_write_stream_abort

@@ -1163,31 +1163,56 @@ logicalrep_read_stream_commit(StringInfo in,
LogicalRepCommitData *commit_data)
 /*
  * Write STREAM ABORT to the output stream. Note that xid and subxid will be
  * same for the top-level transaction abort.
+ *
+ * If write_abort_lsn is true, send the abort_lsn and abort_time fields.
+ * Otherwise not.
  */

"Otherwise not." -> ", otherwise don't."

~~~

1.5 src/backend/replication/logical/proto.c - logicalrep_read_stream_abort

+ *
+ * If read_abort_lsn is true, try to read the abort_lsn and abort_time fields.
+ * Otherwise not.
  */
 void
-logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
- TransactionId *subxid)
+logicalrep_read_stream_abort(StringInfo in,
+ LogicalRepStreamAbortData *abort_data,
+ bool read_abort_lsn)

"Otherwise not." -> ", otherwise don't."

======

1.6 src/backend/replication/logical/worker.c - file comment

+ * If streaming = parallel, We assign a new apply background worker (if
+ * available) as soon as the xact's first stream is received. The main apply

"We" -> "we" ... or maybe better just remove it completely.

~~~

1.7 src/backend/replication/logical/worker.c - apply_handle_stream_prepare

+ /*
+ * After sending the data to the apply background worker, wait for
+ * that worker to finish. This is necessary to maintain commit
+ * order which avoids failures due to transaction dependencies and
+ * deadlocks.
+ */
+ apply_bgworker_send_data(wstate, s->len, s->data);
+ apply_bgworker_wait_for(wstate, APPLY_BGWORKER_FINISHED);
+ apply_bgworker_free(wstate);

The comment should be changed how you had suggested [2], so that it
will be formatted the same way as a couple of other similar comments.

~~~

1.8 src/backend/replication/logical/worker.c - apply_handle_stream_abort

+ /* Check whether the publisher sends abort_lsn and abort_time. */
+ if (am_apply_bgworker())
+ read_abort_lsn = MyParallelState->server_version >= 160000;

This is handling decisions about read/write of the protocol bytes. I
think feel like it will be better to be checking the server *protocol*
version (not the server postgres version) to make this decision – e.g.
this code should be using the new macro you introduced so it will end
up looking much like how the pgoutput_stream_abort code is doing it.

~~~

1.9 src/backend/replication/logical/worker.c - store_flush_position

@@ -2636,6 +2999,10 @@ store_flush_position(XLogRecPtr remote_lsn)
 {
  FlushPosition *flushpos;

+ /* We only need to collect the LSN in main apply worker */
+ if (am_apply_bgworker())
+ return;
+

SUGGESTION
/* Skip if not the main apply worker */

======

1.10 src/backend/replication/pgoutput/pgoutput.c

@@ -1820,6 +1820,8 @@ pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
    XLogRecPtr abort_lsn)
 {
  ReorderBufferTXN *toptxn;
+ bool write_abort_lsn = false;
+ PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;

  /*
  * The abort should happen outside streaming block, even for streamed
@@ -1832,8 +1834,13 @@ pgoutput_stream_abort(struct LogicalDecodingContext *ctx,

  Assert(rbtxn_is_streamed(toptxn));

+ /* We only send abort_lsn and abort_time if the subscriber needs them. */
+ if (data->protocol_version >= LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM)
+ write_abort_lsn = true;
+

IMO it's simpler to remove the declaration default assignment, and
instead this code can be written as:

write_abort_lsn = data->protocol_version >=
LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM;

======

1.11 src/include/replication/logicalproto.h

+ *
+ * LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM is the minimum protocol version
+ * with support for streaming large transactions in apply background worker.
+ * Introduced in PG16.

"in apply background worker" -> "using apply background workers"

~~~

1.12

+extern void logicalrep_read_stream_abort(StringInfo in,
+ LogicalRepStreamAbortData *abort_data,
+ bool include_abort_lsn);

I think the "include_abort_lsn" is now renamed to "include_abort_lsn".


========
v16-0002
========

No comments.


========
v16-0003
========

3.0 <general>

Same comment about "parallel mode" as in comment #1.0

======

3.1 doc/src/sgml/ref/create_subscription.sgml

+          the publisher-side; 2) there cannot be any non-immutable functions
+          in the subscriber-side replicated table.

The functions are not table data so maybe it's better to say
"functions in the ..." -> "functions used by the ...". If you change
this then there are equivalent comments and commit messages that
should change to match it.

======

3.2 .../replication/logical/applybgworker.c - apply_bgworker_relation_check

+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot replicate target relation \"%s.%s\" in parallel "
+ "mode", rel->remoterel.nspname, rel->remoterel.relname),
+ errdetail("The unique column on subscriber is not the unique "
+    "column on publisher or there is at least one "
+    "non-immutable function."),
+ errhint("Please change the streaming option to 'on' instead of
'parallel'.")));

3.2a
SUGGESTED errmsg
"cannot replicate target relation \"%s.%s\" using subscription
parameter streaming=parallel"

3.2b
SUGGESTED errhint
"Please change to use subscription parameter streaming=on"

3.3
The errcode seems the wrong one. Perhaps it should be
ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE.

======

3.4 src/backend/replication/logical/proto.c - logicalrep_write_attrs

In [3] you wrote:
I think the file relcache.c should contain cache-build operations, and the code
I added doesn't have this operation. So I didn't change.

But I only gave relcache.c as an example. It can also be a new static
function in this same file, but anyway I still think this big slab of
code might be better if not done inline in logicalrep_write_attrs.

~~~

3.5 src/backend/replication/logical/proto.c - logicalrep_read_attrs

@@ -1012,11 +1062,14 @@ logicalrep_read_attrs(StringInfo in,
LogicalRepRelation *rel)
  {
  uint8 flags;

- /* Check for replica identity column */
+ /* Check for replica identity and unique column */
  flags = pq_getmsgbyte(in);
- if (flags & LOGICALREP_IS_REPLICA_IDENTITY)
+ if (flags & ATTR_IS_REPLICA_IDENTITY)
  attkeys = bms_add_member(attkeys, i);

+ if (flags & ATTR_IS_UNIQUE)
+ attunique = bms_add_member(attunique, i);

The code comment really applies to all 3 statements so maybe better
not to have the blank line here.

======

3.6 src/backend/replication/logical/relation.c - logicalrep_rel_mark_parallel

3.6.a
+ /* Fast path if we marked 'parallel' flag. */
+ if (entry->parallel != PARALLEL_APPLY_UNKNOWN)
+ return;

SUGGESTED
Fast path if 'parallel' flag is already known.

~

3.6.b
+ /* Initialize the flag. */
+ entry->parallel = PARALLEL_APPLY_SAFE;

I think it makes more sense if assigning SAFE is the very *last* thing
this function does – not the first thing.

~

3.6.c
+ /*
+ * First, we check if the unique column in the relation on the
+ * subscriber-side is also the unique column on the publisher-side.
+ */

"First, we check..." -> "First, check..."

~

3.6.d
+ /*
+ * Then, We check if there is any non-immutable function in the local
+ * table. Look for functions in the following places:


"Then, We check..." -> "Then, check"

~~~

3.7 src/backend/replication/logical/relation.c - logicalrep_rel_mark_parallel

From [3] you wrote:
Personally, I do not like to use the `goto` syntax if it is not necessary,
because the `goto` syntax will forcibly change the flow of code execution.

Yes, but OTOH readability is a major consideration too, and in this
function by simply saying goto parallel_unsafe; you can have 3 returns
instead of 7 returns, and it will take ~10 lines less code to do the
same functionality.

======

3.8 src/include/replication/logicalrelation.h

+/*
+ * States to determine if changes on one relation can be applied by an apply
+ * background worker.
+ */
+typedef enum RelParallel
+{
+ PARALLEL_APPLY_UNKNOWN = 0, /* unknown  */
+ PARALLEL_APPLY_SAFE, /* Can apply changes in an apply background
+    worker */
+ PARALLEL_APPLY_UNSAFE /* Can not apply changes in an apply background
+    worker */
+} RelParallel;

3.8a
"can be applied by an apply background worker." -> "can be applied
using an apply background worker."

~

3.8b
The enum is described, and IMO the enum values are self-explanatory
now. So commenting them individually is not adding any useful
information. I think those comments can be removed.

~

3.8c
The RelParallel name does not have much meaning to it - there is
nothing really about that name that says it is related to validation
states. Maybe "ParallelSafety" or "ParalleApplySafety" or something
similar?

~~~

3.9 src/include/replication/logicalrelation.h

+ RelParallel parallel; /* Can apply changes in an apply
+    background worker? */

This comment is like #3.8c.

IMO the member name 'parallel' doesn't really have enough meaning.
What about something like 'parallel_apply', or 'parallel_ok', or
'parallel_safe', or something similar.

======

3.10 .../subscription/t/032_streaming_apply.pl

In [3] you wrote:
Since it takes almost no time, I think a more detailed confirmation is fine.

Yes, but I think a confirmation is a confirmation regardless - the
test will either pass/fail and this additional code won't change the
result. e.g. Maybe the extra code does not hurt much, but AFAIK having
a "detailed confirmation" doesn't really achieve anything useful
either. I previously suggested to removed it simply because it means
less test code to maintain.

========
v16-0004
========

4.0 <general>

Same comment about "parallel mode" as in comment #1.0

======

4.1 Commit message

If the user sets the subscription_parameter "streaming" to "parallel", when
applying a streaming transaction, we will try to apply this transaction in
apply background worker. However, when the changes in this transaction cannot
be applied in apply background worker, the background worker will exit with an
error. In this case, we can retry applying this streaming transaction in "on"
mode. In this way, we may avoid blocking logical replication here.

So we introduce field "subretry" in catalog "pg_subscription". When the
subscriber exit with an error, we will try to set this flag to true, and when
the transaction is applied successfully, we will try to set this flag to false.

Then when we try to apply a streaming transaction in apply background worker,
we can see if this transaction has failed before based on the "subretry" field.

~

I reworded above to remove most of the "we" this and "we" that...

SUGGESTION
When the subscription parameter is set streaming=parallel, the logic
tries to apply the streaming transaction using an apply background
worker. If this fails the background worker exits with an error.

In this case, retry applying the streaming transaction using the
normal streaming=on mode. This is done to avoid getting caught in a
loop of the same retry errors.

A new flag field "subretry" has been introduced to catalog
"pg_subscription". If the subscriber exits with an error, this flag
will be set true, and whenever the transaction is applied
successfully, this flag is reset false. Now, when deciding how to
apply a streaming transaction, the logic can know if this transaction
has previously failed or not (by checking the "subretry" field).

======

4.2 doc/src/sgml/catalogs.sgml

+      <para>
+       True if the previous apply change failed and a retry was required.
+      </para></entry>

"was" required? "will be required"? It is a bit vague what tense to use...

SUGGESTION 1
True if the previous apply change failed, necessitating a retry

SUGGESTION 2
True if the previous apply change failed

======

4.3 doc/src/sgml/ref/create_subscription.sgml

+          <literal>parallel</literal> mode is disregarded when retrying;
+          instead the transaction will be applied using <literal>on</literal>
+          mode.

"on mode" etc sounds strange.

SUGGESTION
During the retry the streaming=parallel mode is ignored. The retried
transaction will be applied using streaming=on mode.

======

4.4 src/backend/replication/logical/worker.c - set_subscription_retry

+ if (MySubscription->retry == retry ||
+ am_apply_bgworker())
+ return;
+

Somehow I feel that this quick exit condition is not quite what it
seems. IIUC the purpose of this is really to avoid doing the tuple
updates if it is not necessary to do them. So if retry was already set
true then there is no need to update tuple to true again. So if retry
was already set false then there is no need to update the tuple to
false. But I just don't see how the (hypothetical) code below can work
as expected, because where is the code updating the value of
MySubscription->retry ???

set_subscription_retry(true);
set_subscription_retry(true);

I think at least there needs to be some detailed comments explaining
what this quick exit is really doing because my guess is that
currently it is not quite working as expected.

~~~

4.5

+ /* reset subretry */

Uppercase comment


------
[1] https://github.com/postgres/postgres/commit/8445f5a21d40b969673ca03918c74b4fbc882bf4
[2]
https://www.postgresql.org/message-id/OS3PR01MB62755C6C9A75EB09F7218B589E839%40OS3PR01MB6275.jpnprd01.prod.outlook.com
[3]
https://www.postgresql.org/message-id/OS3PR01MB6275120502A4730AB9932FCA9E839%40OS3PR01MB6275.jpnprd01.prod.outlook.com

Kind Regards,
Peter Smith.
Fujitsu Australia

RE: Perform streaming logical transactions by background workers and parallel apply

From

"wangw.fnst@fujitsu.com"

Date:

13 July 2022, 05:48:45

On Fri, Jul 7, 2022 at 11:32 AM Shi, Yu/侍 雨 <shiy.fnst@cn.fujitsu.com> wrote:
> Thanks for updating the patch.
> 
> Here are some comments.

Thanks for your comments.

> 0001 patch
> ==============
> 1.
> +    /* Check If there are free worker slot(s) */
> +    LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
> 
> I think "Check If" should be "Check if".

Fixed.

> 0003 patch
> ==============
> 1.
> Should we call apply_bgworker_relation_check() in apply_handle_truncate()?

Because TRUNCATE blocks all other operations on the table, I think that when
two transactions on the publisher-side operate on the same table, at least one
of them will be blocked. So I think for this case the blocking will happen on
the publisher-side.

> 0004 patch
> ==============
> 1.
> @@ -3932,6 +3958,9 @@ start_apply(XLogRecPtr origin_startpos)
>      }
>      PG_CATCH();
>      {
> +        /* Set the flag that we will retry later. */
> +        set_subscription_retry(true);
> +
>          if (MySubscription->disableonerr)
>              DisableSubscriptionAndExit();
>          Else
> 
> I think we need to emit the error and recover from the error state before
> setting the retry flag, like what we do in DisableSubscriptionAndExit().
> Otherwise if an error is detected when setting the retry flag, we won't get the
> error message reported by the apply worker.

You are right.
I fixed this point as you suggested. (I moved the operations you mentioned from
the function DisableSubscriptionAndExit to before setting the retry flag.)
I also made a similar modification in the function start_table_sync.

Attach the news patches.

Regards,
Wang wei

Attachment

RE: Perform streaming logical transactions by background workers and parallel apply

From

"wangw.fnst@fujitsu.com"

Date:

19 July 2022, 02:28:43

On Wed, Jul 13, 2022 at 13:49 PM Peter Smith <smithpb2250@gmail.com> wrote:
> Below are my review comments for the v16* patch set:

Thanks for your comments.

> ========
> v16-0001
> ========
> 
> 1.0 <general>
> 
> There are places (comments, docs, errmsgs, etc) in the patch referring
> to "parallel mode". I think every one of those references should be
> found and renamed to "parallel streaming mode" or "streaming=parallel"
> or at the very least match sure that "streaming" is in the same
> sentence. IMO it's too vague just saying "parallel" without also
> saying the context is for the "streaming" parameter.
> 
> I have commented on some of those examples below, but please search
> everything anyway (including the docs) to catch the ones I haven't
> explicitly mentioned.

I checked all places in the patch where the word "parallel" is used (case
insensitive), and I think it is clear that the description is related to stream
transactions. So I am not so sure. Could you please give me some examples? I
will improve them later.

> 1.2 .../replication/logical/applybgworker.c - apply_bgworker_start
> 
> + if (list_length(ApplyWorkersFreeList) > 0)
> + {
> + wstate = (ApplyBgworkerState *) llast(ApplyWorkersFreeList);
> + ApplyWorkersFreeList = list_delete_last(ApplyWorkersFreeList);
> + Assert(wstate->pstate->status == APPLY_BGWORKER_FINISHED);
> + }
> 
> The Assert that the entries in the free-list are FINISHED seems like
> unnecessary checking. IIUC, code is already doing the Assert that
> entries are FINISHED before allowing them into the free-list in the
> first place.

Just for robustness.

> 1.3 .../replication/logical/applybgworker.c - apply_bgworker_find
> 
> + if (found)
> + {
> + char status = entry->wstate->pstate->status;
> +
> + /* If any workers (or the postmaster) have died, we have failed. */
> + if (status == APPLY_BGWORKER_EXIT)
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("background worker %u failed to apply transaction %u",
> + entry->wstate->pstate->n,
> + entry->wstate->pstate->stream_xid)));
> +
> + Assert(status == APPLY_BGWORKER_BUSY);
> +
> + return entry->wstate;
> + }
> 
> Why not remove that Assert but change the condition to be:
> 
> if (status != APPLY_BGWORKER_BUSY)
> ereport(...)

When I check "APPLY_BGWORKER_EXIT", I use the function "ereport" to report the
error, because "APPLY_BGWORKER_EXIT" is a possible use case.
But for "APPLY_BGWORKER_BUSY", this use case should not happen here. So I think
it's fine to only check this for developers when the compile option
"--enable-cassert" is specified.

> ========
> v16-0003
> ========
> 
> 3.0 <general>
> 
> Same comment about "parallel mode" as in comment #1.0
> 
> ======

Please refer to the reply to #1.0.

> 3.5 src/backend/replication/logical/proto.c - logicalrep_read_attrs
> 
> @@ -1012,11 +1062,14 @@ logicalrep_read_attrs(StringInfo in,
> LogicalRepRelation *rel)
>   {
>   uint8 flags;
> 
> - /* Check for replica identity column */
> + /* Check for replica identity and unique column */
>   flags = pq_getmsgbyte(in);
> - if (flags & LOGICALREP_IS_REPLICA_IDENTITY)
> + if (flags & ATTR_IS_REPLICA_IDENTITY)
>   attkeys = bms_add_member(attkeys, i);
> 
> + if (flags & ATTR_IS_UNIQUE)
> + attunique = bms_add_member(attunique, i);
> 
> The code comment really applies to all 3 statements so maybe better
> not to have the blank line here.

I think it looks a bit messy without the blank line.
So I tried to improve it to the following:
```
        /* Check for replica identity column */
        flags = pq_getmsgbyte(in);
        if (flags & ATTR_IS_REPLICA_IDENTITY)
            attkeys = bms_add_member(attkeys, i);

        /* Check for unique column */
        if (flags & ATTR_IS_UNIQUE)
            attunique = bms_add_member(attunique, i);
```

> 3.6 src/backend/replication/logical/relation.c - logicalrep_rel_mark_parallel
> 
> 3.6.a
> + /* Fast path if we marked 'parallel' flag. */
> + if (entry->parallel != PARALLEL_APPLY_UNKNOWN)
> + return;
> 
> SUGGESTED
> Fast path if 'parallel' flag is already known.
> 
> ~
> 
> 3.6.b
> + /* Initialize the flag. */
> + entry->parallel = PARALLEL_APPLY_SAFE;
> 
> I think it makes more sense if assigning SAFE is the very *last* thing
> this function does – not the first thing.
> 
> ~
> 
> 3.6.c
> + /*
> + * First, we check if the unique column in the relation on the
> + * subscriber-side is also the unique column on the publisher-side.
> + */
> 
> "First, we check..." -> "First, check..."
> 
> ~
> 
> 3.6.d
> + /*
> + * Then, We check if there is any non-immutable function in the local
> + * table. Look for functions in the following places:
> 
> 
> "Then, We check..." -> "Then, check"

=>3.6.a
=>3.6.c
=>3.6.d
Improved as suggested.

=>3.6.b
Not sure about this.

> 3.7 src/backend/replication/logical/relation.c - logicalrep_rel_mark_parallel
> 
> From [3] you wrote:
> Personally, I do not like to use the `goto` syntax if it is not necessary,
> because the `goto` syntax will forcibly change the flow of code execution.
> 
> Yes, but OTOH readability is a major consideration too, and in this
> function by simply saying goto parallel_unsafe; you can have 3 returns
> instead of 7 returns, and it will take ~10 lines less code to do the
> same functionality.

I am still not sure about this, I think I will change this if some more people
think `goto` is better here.

> 4.3 doc/src/sgml/ref/create_subscription.sgml
> 
> +          <literal>parallel</literal> mode is disregarded when retrying;
> +          instead the transaction will be applied using <literal>on</literal>
> +          mode.
> 
> "on mode" etc sounds strange.
> 
> SUGGESTION
> During the retry the streaming=parallel mode is ignored. The retried
> transaction will be applied using streaming=on mode.

Since it's part of the streaming option document. I think it's fine to directly
say "<literal>parallel</literal> mode"

> 4.4 src/backend/replication/logical/worker.c - set_subscription_retry
> 
> + if (MySubscription->retry == retry ||
> + am_apply_bgworker())
> + return;
> +
> 
> Somehow I feel that this quick exit condition is not quite what it
> seems. IIUC the purpose of this is really to avoid doing the tuple
> updates if it is not necessary to do them. So if retry was already set
> true then there is no need to update tuple to true again. So if retry
> was already set false then there is no need to update the tuple to
> false. But I just don't see how the (hypothetical) code below can work
> as expected, because where is the code updating the value of
> MySubscription->retry ???
> 
> set_subscription_retry(true);
> set_subscription_retry(true);
> 
> I think at least there needs to be some detailed comments explaining
> what this quick exit is really doing because my guess is that
> currently it is not quite working as expected.

The subscription cache is be updated in maybe_reread_subscription() and is
invoked at every transaction. And we reset the retry flag at transaction end,
so it should be fine. And I think the quick exit check code is similar to
clear_subscription_skip_lsn.

Attach the news patches.

[1] - https://www.postgresql.org/message-id/CAHut%2BPv0yWynWTmp4o34s0d98xVubys9fy%3Dp0YXsZ5_sUcNnMw%40mail.gmail.com

Regards,
Wang wei

Attachment

RE: Perform streaming logical transactions by background workers and parallel apply

From

"wangw.fnst@fujitsu.com"

Date:

22 July 2022, 02:56:39

On Tues, Jul 19, 2022 at 10:29 AM I wrote:
> Attach the news patches.

Not able to apply patches cleanly because the change in HEAD (366283961a).
Therefore, I rebased the patch based on the changes in HEAD.

Attach the new patches.

Regards,
Wang wei

Here are some review comments for patch v19-0003:

======

3.1 doc/src/sgml/ref/create_subscription.sgml

@@ -240,6 +240,10 @@ CREATE SUBSCRIPTION <replaceable
class="parameter">subscription_name</replaceabl
           transaction is committed. Note that if an error happens when
           applying changes in a background worker, the finish LSN of the
           remote transaction might not be reported in the server log.
+          <literal>parallel</literal> mode has two requirements: 1) the unique
+          column in the relation on the subscriber-side should also be the
+          unique column on the publisher-side; 2) there cannot be any
+          non-immutable functions used by the subscriber-side replicated table.
          </para>

3.1a.
It looked a bit strange starting the sentence with the enum
"<literal>parallel</literal> mode". Maybe reword it something like:

"This mode has two requirements: ..."
or
"There are two requirements for using <literal>parallel</literal> mode: ..."

3.1b.
Point 1) says "relation", but point 2) says "table". I think the
consistent term should be used.

======

3.2 <general>

For consistency, please search all this patch and replace every:

"... applied by an apply background worker" -> "... applied using an
apply background worker"

And also search/replace every:

"... in the apply background worker: -> "... using an apply background worker"

======

3.3 .../replication/logical/applybgworker.c

@@ -800,3 +800,47 @@ apply_bgworker_subxact_info_add(TransactionId current_xid)
  MemoryContextSwitchTo(oldctx);
  }
 }
+
+/*
+ * Check if changes on this relation can be applied by an apply background
+ * worker.
+ *
+ * Although the commit order is maintained only allowing one process to commit
+ * at a time, the access order to the relation has changed. This could cause
+ * unexpected problems if the unique column on the replicated table is
+ * inconsistent with the publisher-side or contains non-immutable functions
+ * when applying transactions in the apply background worker.
+ */
+void
+apply_bgworker_relation_check(LogicalRepRelMapEntry *rel)

"only allowing" -> "by only allowing" (I think you mean this, right?)

~~~

3.4

+ /*
+ * Return if changes on this relation can be applied by an apply background
+ * worker.
+ */
+ if (rel->parallel_apply == PARALLEL_APPLY_SAFE)
+ return;
+
+ /* We are in error mode and should give user correct error. */
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot replicate target relation \"%s.%s\" using "
+ "subscription parameter streaming=parallel",
+ rel->remoterel.nspname, rel->remoterel.relname),
+ errdetail("The unique column on subscriber is not the unique "
+    "column on publisher or there is at least one "
+    "non-immutable function."),
+ errhint("Please change to use subscription parameter "
+ "streaming=on.")));

3.4a.
Of course, the code should give the user the "correct error" if there
is an error (!), but having a comment explicitly saying so does not
serve any purpose.

3.4b.
The logic might be simplified if it was written differently like:

+ if (rel->parallel_apply != PARALLEL_APPLY_SAFE)
+ ereport(ERROR, ...

======

3.5 src/backend/replication/logical/proto.c

@@ -40,6 +41,68 @@ static void logicalrep_read_tuple(StringInfo in,
LogicalRepTupleData *tuple);
 static void logicalrep_write_namespace(StringInfo out, Oid nspid);
 static const char *logicalrep_read_namespace(StringInfo in);

+static Bitmapset *RelationGetUniqueKeyBitmap(Relation rel);
+
+/*
+ * RelationGetUniqueKeyBitmap -- get a bitmap of unique attribute numbers
+ *
+ * This is similar to RelationGetIdentityKeyBitmap(), but returns a bitmap of
+ * index attribute numbers for all unique indexes.
+ */
+static Bitmapset *
+RelationGetUniqueKeyBitmap(Relation rel)

Why is the forward declaration needed when the static function
immediately follows it?

======

3.6 src/backend/replication/logical/relation.c -
logicalrep_relmap_reset_parallel_cb

@@ -91,6 +98,26 @@ logicalrep_relmap_invalidate_cb(Datum arg, Oid reloid)
  }
 }

+/*
+ * Relcache invalidation callback to reset parallel flag.
+ */
+static void
+logicalrep_relmap_reset_parallel_cb(Datum arg, int cacheid, uint32 hashvalue)

"reset parallel flag" -> "reset parallel_apply flag"

~~~

3.7 src/backend/replication/logical/relation.c -
logicalrep_rel_mark_parallel_apply

+ * There are two requirements for applying changes in an apply background
+ * worker: 1) The unique column in the relation on the subscriber-side should
+ * also be the unique column on the publisher-side; 2) There cannot be any
+ * non-immutable functions used by the subscriber-side.

This comment should exactly match the help text. See review comment #3.1

~~~

3.8

+ /* Initialize the flag. */
+ entry->parallel_apply = PARALLEL_APPLY_SAFE;

I previously suggested [1] (#3.6b) to move this. Consider, that you
cannot logically flag the entry as "safe" until you are certain that
it is safe. And you cannot be sure of that until you've passed all the
checks this function is doing. Therefore IMO the assignment to
PARALLEL_APPLY_SAFE should be the last line of the function.

~~~

3.9

+ /*
+ * Then, check if there is any non-immutable function used by the local
+ * table. Look for functions in the following places:
+ * a. trigger functions;
+ * b. Column default value expressions and domain constraints;
+ * c. Constraint expressions;
+ * d. Foreign keys.
+ */

"used by the local table" -> "used by the subscriber-side relation"
(reworded so that it is consistent with the First comment)

~~~

3.10

I previously suggested [1] (#3.7) to use goto in this function to
avoid the excessive number of returns. IMO there is nothing inherently
evil about gotos, so long as they are used with care - sometimes they
are the best option. Anyway, I attached some BEFORE/AFTER example code
to this post - others can judge which way is preferable.

======

3.11 src/backend/utils/cache/typcache.c - GetDomainConstraints

@@ -2540,6 +2540,23 @@ compare_values_of_enum(TypeCacheEntry *tcache,
Oid arg1, Oid arg2)
  return 0;
 }

+/*
+ * GetDomainConstraints --- get DomainConstraintState list of
specified domain type
+ */
+List *
+GetDomainConstraints(Oid type_id)
+{
+ TypeCacheEntry *typentry;
+ List    *constraints = NIL;
+
+ typentry = lookup_type_cache(type_id, TYPECACHE_DOMAIN_CONSTR_INFO);
+
+ if(typentry->domainData != NULL)
+ constraints = typentry->domainData->constraints;
+
+ return constraints;
+}

This function can be simplified (if you want). e.g.

List *
GetDomainConstraints(Oid type_id)
{
TypeCacheEntry *typentry;

typentry = lookup_type_cache(type_id, TYPECACHE_DOMAIN_CONSTR_INFO);

return typentry->domainData ? typentry->domainData->constraints : NIL;
}

======

3.12 src/include/replication/logicalrelation.h

@@ -15,6 +15,19 @@
 #include "access/attmap.h"
 #include "replication/logicalproto.h"

+/*
+ * States to determine if changes on one relation can be applied using an
+ * apply background worker.
+ */
+typedef enum ParalleApplySafety
+{
+ PARALLEL_APPLY_UNKNOWN = 0, /* unknown  */
+ PARALLEL_APPLY_SAFE, /* Can apply changes in an apply background
+    worker */
+ PARALLEL_APPLY_UNSAFE /* Can not apply changes in an apply background
+    worker */
+} ParalleApplySafety;
+

3.12a
Typo in enum and typedef names:
"ParalleApplySafety" -> "ParallelApplySafety"

3.12b
I think the values are quite self-explanatory now. Commenting on each
of them separately is not really adding anything useful.

3.12c.
New enum missing from typedefs.list?

======

3.13 typdefs.list

Should include the new typedef. See comment #3.12c.

------
[1]
https://www.postgresql.org/message-id/OS3PR01MB62758A6AAED27B3A848CEB7A9E8F9%40OS3PR01MB6275.jpnprd01.prod.outlook.com

Kind Regards,
Peter Smith.
Fujitsu Australia

On Wednesday, July 27, 2022 4:22 PM houzj.fnst@fujitsu.com wrote:
> 
> On Tuesday, July 26, 2022 5:34 PM Dilip Kumar <dilipbalaut@gmail.com>
> wrote:
> 
> > 3.
> > Why are we restricting parallel apply workers only for the streamed
> > transactions, because streaming depends upon the size of the logical
> > decoding work mem so making steaming and parallel apply tightly
> > coupled seems too restrictive to me.  Do we see some obvious problems
> > in applying other transactions in parallel?
> 
> We thought there could be some conflict failure and deadlock if we parallel
> apply normal transaction which need transaction dependency check[1]. But I
> will do some more research for this and share the result soon.

After thinking about this, I confirmed that it would be easy to cause deadlock
error if we don't have additional dependency analysis and COMMIT order preserve
handling for parallel apply normal transaction.

Because the basic idea to parallel apply normal transaction in the first
version is that: the main apply worker will receive data from pub and pass them
to apply bgworker without applying by itself. And only before the apply
bgworker apply the final COMMIT command, it need to wait for any previous
transaction to finish to preserve the commit order. It means we could pass the
next transaction's data to another apply bgworker before the previous
transaction is committed in the first apply bgworker.

In this approach, we have to do the dependency analysis because it's easy to
cause dead lock error when applying DMLs in parallel(See the attachment for the
examples where the dead lock could happen). So, it's a bit different from
streaming transaction.

We could apply the next transaction only after the first transaction is
committed in which approach we don't need the dependency analysis, but it would
not bring noticeable performance improvement even if we start serval apply
workers to do that because the actual DMLs are not performed in parallel.

Based on above, we plan to first introduce the patch to perform streaming
logical transactions by background workers, and then introduce parallel apply
normal transaction which design is different and need some additional handling.

Best regards,
Hou zj

> [1]
> https://www.postgresql.org/message-id/CAA4eK1%2BwyN6zpaHUkCLorEW
> Nx75MG0xhMwcFhvjqm2KURZEAGw%40mail.gmail.com

Attachment

deadlock_example.txt

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

02 August 2022, 12:05:22

On Tue, Aug 2, 2022 at 5:16 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> On Wednesday, July 27, 2022 4:22 PM houzj.fnst@fujitsu.com wrote:
> >
> > On Tuesday, July 26, 2022 5:34 PM Dilip Kumar <dilipbalaut@gmail.com>
> > wrote:
> >
> > > 3.
> > > Why are we restricting parallel apply workers only for the streamed
> > > transactions, because streaming depends upon the size of the logical
> > > decoding work mem so making steaming and parallel apply tightly
> > > coupled seems too restrictive to me.  Do we see some obvious problems
> > > in applying other transactions in parallel?
> >
> > We thought there could be some conflict failure and deadlock if we parallel
> > apply normal transaction which need transaction dependency check[1]. But I
> > will do some more research for this and share the result soon.
>
> After thinking about this, I confirmed that it would be easy to cause deadlock
> error if we don't have additional dependency analysis and COMMIT order preserve
> handling for parallel apply normal transaction.
>
> Because the basic idea to parallel apply normal transaction in the first
> version is that: the main apply worker will receive data from pub and pass them
> to apply bgworker without applying by itself. And only before the apply
> bgworker apply the final COMMIT command, it need to wait for any previous
> transaction to finish to preserve the commit order. It means we could pass the
> next transaction's data to another apply bgworker before the previous
> transaction is committed in the first apply bgworker.
>
> In this approach, we have to do the dependency analysis because it's easy to
> cause dead lock error when applying DMLs in parallel(See the attachment for the
> examples where the dead lock could happen). So, it's a bit different from
> streaming transaction.
>
> We could apply the next transaction only after the first transaction is
> committed in which approach we don't need the dependency analysis, but it would
> not bring noticeable performance improvement even if we start serval apply
> workers to do that because the actual DMLs are not performed in parallel.
>

I agree that for short transactions it may not bring noticeable
performance improvement but somewhat larger transactions could still
benefit from parallelism where we won't start to operate on new
transactions without waiting for the previous transaction's commit.
Having said that, I think we can enable parallelism for non-streaming
transactions as a separate patch.

-- 
With Regards,
Amit Kapila.

RE: Perform streaming logical transactions by background workers and parallel apply

From

"wangw.fnst@fujitsu.com"

Date:

04 August 2022, 06:35:45

On Thurs, Jul 28, 2022 at 21:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>

Thanks for your comments and opinions.

> On Wed, Jul 27, 2022 at 1:27 PM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> > BTW, I think the foreign key won't take effect in subscriber's apply worker by
> > default. Because we set session_replication_role to 'replica' in apply worker
> > which prevent the FK trigger function to be executed(only the trigger with
> > FIRES_ON_REPLICA flag will be executed in this mode). User can only alter the
> > trigger to enable it on replica mode to make the foreign key work. So, ISTM,
> we
> > won't hit this ERROR frequently.
> >
> > And based on this, another comment about the patch is that it seems
> unnecessary
> > to directly check the FK returned by RelationGetFKeyList. Checking the actual
> FK
> > trigger function seems enough.
> >
> 
> That is correct. I think it would have been better if we can detect
> that publisher doesn't have FK but the subscriber has FK as it can
> occur only in that scenario. If that requires us to send more
> information from the publisher, we can leave it for now (as this
> doesn't seem to be a frequent scenario) and keep a simpler check based
> on subscriber schema.
> 
> I think we should add a test as mentioned by you above so that if
> tomorrow one tries to remove the FK check, we have a way to know.
> Also, please add comments and tests for additional checks related to
> constraints in the patch.
> 
> [1] - https://www.postgresql.org/message-
> id/CAA4eK1JwahU_WuP3S%2B7POqta%3DPhm_3gxZeVmJuuoUq1NV%3DkrXA
> %40mail.gmail.com

I added some test cases that cause indefinite waits without additional checks
related to constraints. (please see file 032_streaming_apply.pl in 0003-patch)
I also added some comments for FK check and why we need these checks.

In addition, I found another two scenarios that could cause infinite waits, so
I made the following changes:
  1. I check the default values for the columns that only in subscriber-side.
     (Previous versions only checked for columns that existed in both
      publisher-side and subscriber-side.)
  2. When using an apply background worker, the check needs to be performed not
     only in the apply background worker, but also in the main apply worker.

I also did some other improvements based on the suggestions posted in this
thread. Attach the new patches.

Regards,
Wang wei

On Tuesday, August 9, 2022 7:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Thu, Aug 4, 2022 at 12:10 PM wangw.fnst@fujitsu.com
> <wangw.fnst@fujitsu.com> wrote:
> >
> > On Mon, Jul 25, 2022 at 21:50 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > > Few comments on 0001:
> > > ======================
> >
> > Thanks for your comments.
> >
> 
> Review comments on
> v20-0001-Perform-streaming-logical-transactions-by-backgr
> ===================================================
> ============
> 1.
> +         <para>
> +          If set to <literal>on</literal>, the incoming changes are written to
> +          temporary files and then applied only after the transaction is
> +          committed on the publisher.
> 
> It is not very clear that the transaction is applied when the commit is received
> by the subscriber. Can we slightly change it to: "If set to <literal>on</literal>,
> the incoming changes are written to temporary files and then applied only after
> the transaction is committed on the publisher and received by the subscriber."

Changed.

> 2.
> /* First time through, initialize apply workers hashtable */
> + if (ApplyBgworkersHash == NULL)
> + {
> + HASHCTL ctl;
> +
> + MemSet(&ctl, 0, sizeof(ctl));
> + ctl.keysize = sizeof(TransactionId);
> + ctl.entrysize = sizeof(ApplyBgworkerEntry); ctl.hcxt = ApplyContext;
> +
> + ApplyBgworkersHash = hash_create("logical apply workers hash", 8, &ctl,
> +    HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
> 
> I think it would be better if we start with probably 16 element hash table, 8
> seems to be on the lower side.

Changed.

> 3.
> +/*
> + * Try to look up worker assigned before (see function
> +apply_bgworker_get_free)
> + * inside ApplyBgworkersHash for requested xid.
> + */
> +ApplyBgworkerState *
> +apply_bgworker_find(TransactionId xid)
> 
> The above comment is not very clear. There doesn't seem to be any function
> named apply_bgworker_get_free in the patch. Can we write this comment as:
> "Find the previously assigned worker for the given transaction, if any."

Changed the comments.

> 4.
> /*
> + * Push apply error context callback. Fields will be filled applying a
> + * change.
> + */
> 
> /Fields will be filled applying a change./Fields will be filled while applying a
> change.

Changed.

> 5.
> +void
> +ApplyBgworkerMain(Datum main_arg)
> +{
> ...
> ...
> + StartTransactionCommand();
> + oldcontext = MemoryContextSwitchTo(ApplyContext);
> +
> + MySubscription = GetSubscription(MyLogicalRepWorker->subid, true); if
> + (!MySubscription) { ereport(LOG, (errmsg("logical replication apply
> + worker for subscription %u will not "
> + "start because the subscription was removed during startup",
> + MyLogicalRepWorker->subid)));
> + proc_exit(0);
> + }
> +
> + MySubscriptionValid = true;
> + MemoryContextSwitchTo(oldcontext);
> +
> + /* Setup synchronous commit according to the user's wishes */
> + SetConfigOption("synchronous_commit", MySubscription->synccommit,
> + PGC_BACKEND, PGC_S_OVERRIDE);
> +
> + /* Keep us informed about subscription changes. */
> + CacheRegisterSyscacheCallback(SUBSCRIPTIONOID,
> +   subscription_change_cb,
> +   (Datum) 0);
> +
> + CommitTransactionCommand();
> ...
> 
> This part appears of the code appears to be the same as we have in
> ApplyWorkerMain() except that the patch doesn't check whether the
> subscription is enabled. Is there a reason to not have that check here as well?
> Then in ApplyWorkerMain(), we do LOG the type of worker that is also missing
> here. Unless there is a specific reason to have a different code here, we should
> move this part to a common function and call it both from ApplyWorkerMain()
> and ApplyBgworkerMain().
> 6. I think the code in ApplyBgworkerMain() to set session_replication_role,
> search_path, and connect to the database also appears to be the same in
> ApplyWorkerMain(). If so, that can also be moved to the common function
> mentioned in the previous point.
> 
> 7. I think we need to register for subscription rel map invalidation
> (invalidate_syncing_table_states) in ApplyBgworkerMain similar to
> ApplyWorkerMain. The reason is that we check the table state after processing
> a commit or similar change record via a call to process_syncing_tables.

Agreed and changed.

> 8. In apply_bgworker_setup_dsm(), we should have handling related to
> dsm_create failure due to max_segments reached as we have in
> InitializeParallelDSM(). We can follow the regular path of streaming
> transactions in case we are not able to create DSM instead of parallelizing it.

Changed.

> 9.
> + shm_toc_initialize_estimator(&e);
> + shm_toc_estimate_chunk(&e, sizeof(ApplyBgworkerShared));
> + shm_toc_estimate_chunk(&e, (Size) queue_size);
> +
> + shm_toc_estimate_keys(&e, 1 + 1);
> 
> Here, you can directly write 2 instead of (1 + 1) stuff. It is quite clear that we
> need two keys here.

Changed.

> 10.
> apply_bgworker_wait_for()
> {
> ...
> + /* Wait to be signalled. */
> + WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, 0,
> +   WAIT_EVENT_LOGICAL_APPLY_BGWORKER_STATE_CHANGE);
> ...
> }
> 
> Typecast with the void, if we don't care for the return value.

Changed.

> 11.
> +static void
> +apply_bgworker_shutdown(int code, Datum arg) {
> +SpinLockAcquire(&MyParallelShared->mutex);
> + MyParallelShared->status = APPLY_BGWORKER_EXIT;
> + SpinLockRelease(&MyParallelShared->mutex);
> 
> Is there a reason to not use apply_bgworker_set_status() directly?

No, changed to use that function.

> 12.
> + * Special case is if the first change comes from subtransaction, then
> + * we check that current_xid differs from stream_xid.
> + */
> +void
> +apply_bgworker_subxact_info_add(TransactionId current_xid) {  if
> +(current_xid != stream_xid &&  !list_member_int(subxactlist, (int)
> +current_xid))
> ...
> ...
> 
> I don't understand the above comment. Does that mean we don't need to
> define a savepoint if the first change is from a subtransaction? Also, keep an
> empty line before the above comment.

After checking, I think this comment is not very clear and have removed it
and improve other comments.

> 13.
> +void
> +apply_bgworker_subxact_info_add(TransactionId current_xid) {  if
> +(current_xid != stream_xid &&  !list_member_int(subxactlist, (int)
> +current_xid))  {  MemoryContext oldctx;  char spname[MAXPGPATH];
> +
> + snprintf(spname, MAXPGPATH, "savepoint_for_xid_%u", current_xid);
> 
> To uniquely generate the savepoint name, it is better to append the
> subscription id as well? Something like pg_sp_<subid>_<xid>.

Changed.

> 14. The CommitTransactionCommand() call in
> apply_bgworker_subxact_info_add looks a bit odd as that function neither
> seems to be starting the transaction command nor has any comments
> explaining it. Shall we do it in caller where it is more apparent to do the same?

I think the CommitTransactionCommand here is used to cooperate the
DefineSavepoint because we need to invoke CommitTransactionCommand to
start a new subtransaction. I tried to add some comments to explain the same.

> 15.
> else
>   snprintf(bgw.bgw_name, BGW_MAXLEN,
>   "logical replication worker for subscription %u", subid);
> +
>   snprintf(bgw.bgw_type, BGW_MAXLEN, "logical replication worker");
> 
> Spurious new line

Removed.

> 16.
> @@ -1153,7 +1162,14 @@ replorigin_session_setup(RepOriginId node)
> 
>   Assert(session_replication_state->roident != InvalidRepOriginId);
> 
> - session_replication_state->acquired_by = MyProcPid;
> + if (must_acquire)
> + session_replication_state->acquired_by = MyProcPid; else if
> + (session_replication_state->acquired_by == 0) ereport(ERROR,
> + (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
> + errmsg("apply background worker could not find replication state
> slot for replication origin with OID %u",
> + node),
> + errdetail("There is no replication state slot set by its main apply
> worker.")));
> 
> It is not a good idea to give apply workers specific messages from this API
> because I don't think we can assume this is used by only apply workers. It seems
> to me that if 'must_acquire' is false, then we should either give elog(ERROR, ..)
> or there should be an Assert for the same. I am not completely sure but maybe
> we can request the caller to supply the PID (which already has acquired this
> origin) in case must_acquire is false and then use it in Assert/elog to ensure the
> correct usage of API. What do you think?

Agreed. I think we can replace the 'must_acquire' with the pid of worker which
acquired this origin(called 'acquired_by'). We can use this pid to check and
report the error if needed.

> 17. The commit message can explain the abort-related new information this
> patch sends to the subscribers.

Added.

> 18.
> + * In streaming case (receiving a block of streamed transaction), for
> + * SUBSTREAM_ON mode, simply redirect it to a file for the proper
> + toplevel
> + * transaction, and for SUBSTREAM_PARALLEL mode, send the changes to
> + apply
> + * background workers (LOGICAL_REP_MSG_RELATION or
> LOGICAL_REP_MSG_TYPE
> + changes
> + * will also be applied in main apply worker).
> 
> In this, part of the comment "(LOGICAL_REP_MSG_RELATION or
> LOGICAL_REP_MSG_TYPE changes will also be applied in main apply worker)" is
> not very clear. Do you mean to say that these messages are applied by both
> main and background apply workers, if so, then please state the same
> explicitly?

Changed.

> 19.
> - /* not in streaming mode */
> - if (!in_streamed_transaction)
> + /* Not in streaming mode */
> + if (!(in_streamed_transaction || am_apply_bgworker()))
> ...
> ...
> - /* write the change to the current file */
> + /* Write the change to the current file */
>   stream_write_change(action, s);
> 
> I don't see the need to change the above comments.

Remove the changes.

> 20.
>  static bool
>  handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)  { ...
> ...
> + if (am_apply_bgworker())
> + {
> + /* Define a savepoint for a subxact if needed. */
> + apply_bgworker_subxact_info_add(current_xid);
> +
> + return false;
> + }
> +
> + if (apply_bgworker_active())
> 
> Isn't it better to use else if in the above code and probably else for the
> remaining part of code in this function?

Changed.

Attach new version(v21) patch set which addressed all the comments received so far.

Best regards,
Hou zj

On Thursday, August 11, 2022 3:48 PM houzj.fnst@fujitsu.com wrote: 
> 
> On Tuesday, August 9, 2022 7:00 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> >
> > Review comments on
> > v20-0001-Perform-streaming-logical-transactions-by-backgr
> 
> Attach new version(v21) patch set which addressed all the comments received
> so far.
> 

Sorry, I didn't include the documentation changes. Here is the complete patch set.

Best regards,
Hou zj

On Fri, August 12, 2022 12:46 PM Peter Smith <smithpb2250@gmail.com> wrote:
> Here are some review comments for v20-0003:
> 
> (Sorry - the reviews are time consuming, so I am lagging slightly
> behind the latest posted version)

Thanks for your comments.

> 1. <General>
> 
> 1a.
> There are a few comment modifications in this patch (e.g. changing
> FROM "in an apply background worker" TO "using an apply background
> worker"). e.g. I noticed lots of these in worker.c but they might be
> in other files too.
> 
> Although these are good changes, these are just tweaks to new comments
> introduced by patch 0001, so IMO such changes belong in that patch,
> not in this one.
> 
> 1b.
> Actually, there are still some comments says "by an apply background
> worker///" and some saying "using an apply background worker..." and
> some saying "in the apply background worker...". Maybe they are all
> OK, but it will be better if all such can be searched and made to have
> consistent wording

Improved.

> 2. Commit message
> 
> 2a.
> 
> Without these restrictions, the following scenario may occur:
> The apply background worker lock a row when processing a streaming
> transaction,
> after that the main apply worker tries to lock the same row when processing
> another transaction. At this time, the main apply worker waits for the
> streaming transaction to complete and the lock to be released, it won't send
> subsequent data of the streaming transaction to the apply background worker;
> the apply background worker waits to receive the rest of streaming transaction
> and can't finish this transaction. Then the main apply worker will wait
> indefinitely.
> 
> "background worker lock a row" -> "background worker locks a row"
> 
> "Then the main apply worker will wait indefinitely." -> really, you
> already said the main apply worker is waiting, so I think this
> sentence only needs to say: "Now a deadlock has occurred, so both
> workers will wait indefinitely."
> 
> 2b.
> 
> Text fragments are all common between:
> 
> i.   This commit message
> ii.  Text in pgdocs CREATE SUBSCRIPTION
> iii. Function comment for 'logicalrep_rel_mark_parallel_apply' in relation.c
> 
> After addressing other review comments please make sure all those 3
> parts are worded same.

Improved.

> 3. doc/src/sgml/ref/create_subscription.sgml
> 
> +          There are two requirements for using <literal>parallel</literal>
> +          mode: 1) the unique column in the table on the subscriber-side should
> +          also be the unique column on the publisher-side; 2) there cannot be
> +          any non-immutable functions used by the subscriber-side replicated
> +          table.
> 
> 3a.
> I am not sure – is "requirements" the correct word here, or maybe it
> should be "prerequisites".
> 
> 3b.
> Is it correct to say "should also be", or should that say "must also be"?

Improved.

> 4. src/backend/replication/logical/applybgworker.c -
> apply_bgworker_relation_check
> 
> + /*
> + * Skip check if not using apply background workers.
> + *
> + * If any worker is handling the streaming transaction, this check needs to
> + * be performed not only in the apply background worker, but also in the
> + * main apply worker. This is because without these restrictions, main
> + * apply worker may block apply background worker, which will cause
> + * infinite waits.
> + */
> + if (!am_apply_bgworker() &&
> + (list_length(ApplyBgworkersFreeList) == list_length(ApplyBgworkersList)))
> + return;
> 
> I struggled a bit to reconcile the comment with the condition. Is the
> !am_apply_bgworker() part of this even needed – isn't the
> list_length() check enough?

We need to check this for apply bgworker. (Both lists are "NIL" in apply
bgworker.)

> 5.
> 
> + /* We are in error mode and should give user correct error. */
> 
> I still [1, #3.4a] don't see the value in saying "should give correct
> error" (e.g. what's the alternative?).
> 
> Maybe instead of that comment it can just say:
> rel->parallel_apply = PARALLEL_APPLY_UNSAFE;

I changed if-statement to report the error:
If 'parallel_apply' isn't 'PARALLEL_APPLY_SAFE', then report the error.

> 6. src/backend/replication/logical/proto.c - RelationGetUniqueKeyBitmap
> 
> + /* Add referenced attributes to idindexattrs */
> + for (i = 0; i < indexRel->rd_index->indnatts; i++)
> + {
> + int attrnum = indexRel->rd_index->indkey.values[i];
> +
> + /*
> + * We don't include non-key columns into idindexattrs
> + * bitmaps. See RelationGetIndexAttrBitmap.
> + */
> + if (attrnum != 0)
> + {
> + if (i < indexRel->rd_index->indnkeyatts &&
> + !bms_is_member(attrnum - FirstLowInvalidHeapAttributeNumber, attunique))
> + attunique = bms_add_member(attunique,
> +    attrnum - FirstLowInvalidHeapAttributeNumber);
> + }
> + }
> 
> There are 2x comments in that code that are referring to
> 'idindexattrs' but I think it is a cut/paste problem because that
> variable name does not even exist in this copied function.

Fixed the comments.

> 7. src/backend/replication/logical/relation.c -
> logicalrep_rel_mark_parallel_apply
> 
> + /* Initialize the flag. */
> + entry->parallel_apply = PARALLEL_APPLY_SAFE;
> 
> I have unsuccessfully repeated the same review comment several times
> [1 #3.8] suggesting that this flag should not be initialized to SAFE.
> IMO the state should remain as UNKNOWN until you are either sure it is
> SAFE, or sure it is UNSAFE. Anyway, I'll give up on this point now;
> let's see what other people think.

Okay, I will follow the relevant comments later.

> 8. src/include/replication/logicalrelation.h
> 
> +/*
> + * States to determine if changes on one relation can be applied using an
> + * apply background worker.
> + */
> +typedef enum ParallelApplySafety
> +{
> + PARALLEL_APPLY_UNKNOWN = 0, /* unknown  */
> + PARALLEL_APPLY_SAFE, /* Can apply changes using an apply background
> +    worker */
> + PARALLEL_APPLY_UNSAFE /* Can not apply changes using an apply
> +    background worker */
> +} ParallelApplySafety;
> +
> 
> I think the values are self-explanatory so the comments for every
> value add nothing here, particularly since the enum itself has a
> comment saying the same thing. I'm not sure if you accidentally missed
> my previous comment [1, #3.12b] about this, or just did not agree with
> it.

Changed.

> 9. .../subscription/t/015_stream.pl
> 
> +# "streaming = parallel" does not support non-immutable functions, so change
> +# the function in the defult expression of column "c".
> +$node_subscriber->safe_psql(
> + 'postgres', qq{
> +ALTER TABLE test_tab ALTER COLUMN c SET DEFAULT
> to_timestamp(1284352323);
> +ALTER SUBSCRIPTION tap_sub SET(streaming = parallel, binary = off);
> +});
> 
> 9a.
> typo "defult"
> 
> 9b.
> The problem with to_timestamp(1284352323) is that it looks like it
> must be some special value, but in fact AFAIK you don't care at all
> what value timestamp this is. I think it would be better here to just
> use to_timestamp(0) or to_timestamp(999) or similar so the number is
> obviously not something of importance.
> 
> ======
> 
> 10. .../subscription/t/016_stream.pl
> 
> +# "streaming = parallel" does not support non-immutable functions, so change
> +# the function in the defult expression of column "c".
> +$node_subscriber->safe_psql(
> + 'postgres', qq{
> +ALTER TABLE test_tab ALTER COLUMN c SET DEFAULT
> to_timestamp(1284352323);
> +ALTER SUBSCRIPTION tap_sub SET(streaming = parallel);
> +});
> 
> 10a. Ditto 9a.
> 10b. Ditto 9b.
> 
> ======
> 
> 11. .../subscription/t/022_twophase_cascade.pl
> 
> +# "streaming = parallel" does not support non-immutable functions, so change
> +# the function in the defult expression of column "c".
> +$node_B->safe_psql(
> + 'postgres', "ALTER TABLE test_tab ALTER COLUMN c SET DEFAULT
> to_timestamp(1284352323);");
> +$node_C->safe_psql(
> + 'postgres', "ALTER TABLE test_tab ALTER COLUMN c SET DEFAULT
> to_timestamp(1284352323);");
> +
> 
> 11a. Ditto 9a.
> 11b. Ditto 9b.
> 
> ======
> 
> 12. .../subscription/t/023_twophase_stream.pl
> 
> +# "streaming = parallel" does not support non-immutable functions, so change
> +# the function in the defult expression of column "c".
> +$node_subscriber->safe_psql(
> + 'postgres', qq{
> +ALTER TABLE test_tab ALTER COLUMN c SET DEFAULT
> to_timestamp(1284352323);
> +ALTER SUBSCRIPTION tap_sub SET(streaming = parallel);
> +});
> 
> 12a. Ditto 9a.
> 12b. Ditto 9b.

Improved.

> 13. .../subscription/t/032_streaming_apply.pl
> 
> +# Drop default value on the subscriber, now it works.
> +$node_subscriber->safe_psql('postgres',
> + "ALTER TABLE test_tab1 ALTER COLUMN b DROP DEFAULT");
> 
> Maybe for these tests like this it would be better to test if it works
> OK using an immutable DEFAULT function instead of just completely
> removing the bad function to make it work.
> 
> I think maybe the same was done for TRIGGER tests. There was a test
> for a trigger with a bad function, and then the trigger was removed.
> What about including a test for the trigger with a good function?

Improved.

Attach the new patches.

Regards,
Wang wei

Attachment

RE: Perform streaming logical transactions by background workers and parallel apply

From

"wangw.fnst@fujitsu.com"

Date:

16 August 2022, 07:37:00

On Fri, August 12, 2022 17:22 PM Peter Smith <smithpb2250@gmail.com> wrote:
> Here are some review comments for v20-0004:
> 
> (This completes my reviews of the v20* patch set. Sorry, the reviews
> are time consuming, so I am lagging slightly behind the latest posted
> version)

Thanks for your comments.

> 1. doc/src/sgml/ref/create_subscription.sgml
> 
> @@ -245,6 +245,11 @@ CREATE SUBSCRIPTION <replaceable
> class="parameter">subscription_name</replaceabl
>            also be the unique column on the publisher-side; 2) there cannot be
>            any non-immutable functions used by the subscriber-side replicated
>            table.
> +          When applying a streaming transaction, if either requirement is not
> +          met, the background worker will exit with an error.
> +          The <literal>parallel</literal> mode is disregarded when retrying;
> +          instead the transaction will be applied using <literal>on</literal>
> +          mode.
>           </para>
> 
> The "on mode" still sounds strange to me. Maybe it's just my personal
> opinion, but I don’t really consider 'on' and 'off' to be "modes".
> Anyway I already posted the same comment several times before [1,
> #4.3]. Let's see what others think.
> 
> SUGGESTION
> "using on mode" -> "using streaming = on"

Okay, I will follow the relevant comments later.

> 2. src/backend/replication/logical/worker.c - start_table_sync
> 
> @@ -3902,20 +3925,28 @@ start_table_sync(XLogRecPtr *origin_startpos,
> char **myslotname)
>   }
>   PG_CATCH();
>   {
> + /*
> + * Emit the error message, and recover from the error state to an idle
> + * state
> + */
> + HOLD_INTERRUPTS();
> +
> + EmitErrorReport();
> + AbortOutOfAnyTransaction();
> + FlushErrorState();
> +
> + RESUME_INTERRUPTS();
> +
> + /* Report the worker failed during table synchronization */
> + pgstat_report_subscription_error(MySubscription->oid, false);
> +
>   if (MySubscription->disableonerr)
> - DisableSubscriptionAndExit();
> - else
> - {
> - /*
> - * Report the worker failed during table synchronization. Abort
> - * the current transaction so that the stats message is sent in an
> - * idle state.
> - */
> - AbortOutOfAnyTransaction();
> - pgstat_report_subscription_error(MySubscription->oid, false);
> + DisableSubscriptionOnError();
> 
> - PG_RE_THROW();
> - }
> + /* Set the retry flag. */
> + set_subscription_retry(true);
> +
> + proc_exit(0);
>   }
>   PG_END_TRY();
> 
> Perhaps current code is OK, but I am not 100% sure if we should set
> the retry flag when the disable_on_error is set, because the
> subscription is not going to be retried (because it is disabled). And
> later, if/when the user does enable the subscription, presumably that
> will be after they have already addressed the problem that caused the
> error/disablement in the first place.

I think it is okay. Because even after addressing the problem, it is also
*retrying* to apply the failed transaction. And, in the worst case, it just
applies the first failed streaming transaction using "on" mode instead of
"parallel" mode.

> 3. src/backend/replication/logical/worker.c - start_apply
> 
>   PG_CATCH();
>   {
> + /*
> + * Emit the error message, and recover from the error state to an idle
> + * state
> + */
> + HOLD_INTERRUPTS();
> +
> + EmitErrorReport();
> + AbortOutOfAnyTransaction();
> + FlushErrorState();
> +
> + RESUME_INTERRUPTS();
> +
> + /* Report the worker failed while applying changes */
> + pgstat_report_subscription_error(MySubscription->oid,
> + !am_tablesync_worker());
> +
>   if (MySubscription->disableonerr)
> - DisableSubscriptionAndExit();
> - else
> - {
> - /*
> - * Report the worker failed while applying changes. Abort the
> - * current transaction so that the stats message is sent in an
> - * idle state.
> - */
> - AbortOutOfAnyTransaction();
> - pgstat_report_subscription_error(MySubscription-
> >oid, !am_tablesync_worker());
> + DisableSubscriptionOnError();
> 
> - PG_RE_THROW();
> - }
> + /* Set the retry flag. */
> + set_subscription_retry(true);
>   }
>   PG_END_TRY();
>  }
> 
> 3a.
> Same comment as #2
> 
> 3b.
> This PG_CATCH used to leave by either proc_exit(0) or PG_RE_THROW but
> what does it do now? My first impression is there is a bug here due to
> some missing code, because AFAICT the exception is caught and gobbled
> up and then what...?

=>3a.
See the reply to #2.
=>3b.
The function `proc_exit(0)` is invoked after invoking function start_apply. See
function ApplyWorkerMain.

> 4. src/backend/replication/logical/worker.c - set_subscription_retry
> 
> + if (MySubscription->retry == retry ||
> + am_apply_bgworker())
> + return;
> 
> 4a.
> I this this quick exit can be split and given some appropriate comments
> 
> SUGGESTION (for example)
> /* Fast path - if no state change then nothing to do */
> if (MySubscription->retry == retry)
> return;
> 
> /* Fast path - skip for apply background workers */
> if (am_apply_bgworker())
> return;

Changed.

> 5. .../subscription/t/032_streaming_apply.pl
> 
> @@ -78,9 +78,13 @@ my $timer =
> IPC::Run::timeout($PostgreSQL::Test::Utils::timeout_default);
>  my $h = $node_publisher->background_psql('postgres', \$in, \$out, $timer,
>   on_error_stop => 0);
> 
> +#
> =============================================================
> ===============
> 
> All those comment highlighting lines like "# ==============" really
> belong in the earlier patch (0003 ?) when this TAP test file was
> introduced.

Changed.

The new patches were attached in [1].

[1] -
https://www.postgresql.org/message-id/OS3PR01MB6275739E73E8BEC5D13FB6739E6B9%40OS3PR01MB6275.jpnprd01.prod.outlook.com

Regards,
Wang wei

RE: Perform streaming logical transactions by background workers and parallel apply

From

"wangw.fnst@fujitsu.com"

Date:

17 August 2022, 06:28:21

On Tues, August 16, 2022 15:33 PM I wrote:
> Attach the new patches.

I found that cfbot has a failure.
After investigation, I think it is because the worker's exit state is not set
correctly. So I made some slight modifications.

Attach the new patches.

Regards,
Wang wei

On Friday, August 19, 2022 4:49 PM Amit Kapila <amit.kapila16@gmail.com>
> 
> On Thu, Aug 18, 2022 at 5:14 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > On Wed, Aug 17, 2022 at 11:58 AM wangw.fnst@fujitsu.com
> > <wangw.fnst@fujitsu.com> wrote:
> > >
> > > Attach the new patches.
> > >
> >
> > Few comments on v23-0001
> > =======================
> >
> 
> Some more comments on v23-0001
> ============================
> 1.
> static bool
>  handle_streamed_transaction(LogicalRepMsgType action, StringInfo s) { ...
> - /* not in streaming mode */
> - if (!in_streamed_transaction)
> + /* Not in streaming mode and not in apply background worker. */ if
> + (!(in_streamed_transaction || am_apply_bgworker()))
>   return false;
> 
> This check appears a bit strange because ideally in bgworker
> in_streamed_transaction should be false. I think we should set
> in_streamed_transaction to true in apply_handle_stream_start() only when we
> are going to write to file. Is there a reason for not doing the same?

No, I removed this.

> 2.
> + {
> + /* This is the main apply worker. */
> + ApplyBgworkerInfo *wstate = apply_bgworker_find(xid);
> +
> + /*
> + * Check if we are processing this transaction using an apply
> + * background worker and if so, send the changes to that worker.
> + */
> + if (wstate)
> + {
> + /* Send STREAM ABORT message to the apply background worker. */
> + apply_bgworker_send_data(wstate, s->len, s->data);
> 
> Why at some places the patch needs to separately fetch ApplyBgworkerInfo
> whereas at other places it directly uses stream_apply_worker to pass the data
> to bgworker.
> 3. Why apply_handle_stream_abort() or apply_handle_stream_prepare()
> doesn't use apply_bgworker_active() to identify whether it needs to send the
> information to bgworker?

I think stream_apply_worker is only valid between STREAM_START and STREAM_END,
But it seems it's not clear from the code. So I added some comments and slightly refactor
the code.


> 4. In apply_handle_stream_prepare(), apply_handle_stream_abort(), and some
> other similar functions, the patch handles three cases (a) apply background
> worker, (b) sending data to bgworker, (c) handling for streamed transaction in
> apply worker. I think the code will look better if you move the respective code
> for all three cases into separate functions. Surely, if the code to deal with each
> of the cases is less then we don't need to move it to a separate function.

Refactored and simplified.

> 5.
> @@ -1088,24 +1177,78 @@ apply_handle_stream_prepare(StringInfo s) { ...
> + in_remote_transaction = false;
> +
> + /* Unlink the files with serialized changes and subxact info. */
> + stream_cleanup_files(MyLogicalRepWorker->subid, prepare_data.xid); } }
> 
>   in_remote_transaction = false;
> ...
> 
> We don't need to in_remote_transaction to false in multiple places.

Removed.

> 6.
> @@ -1177,36 +1311,93 @@ apply_handle_stream_start(StringInfo s) { ...
> ...
> + if (am_apply_bgworker())
>   {
> - MemoryContext oldctx;
> -
> - oldctx = MemoryContextSwitchTo(ApplyContext);
> + /*
> + * Make sure the handle apply_dispatch methods are aware we're in a
> + * remote transaction.
> + */
> + in_remote_transaction = true;
> 
> - MyLogicalRepWorker->stream_fileset = palloc(sizeof(FileSet));
> - FileSetInit(MyLogicalRepWorker->stream_fileset);
> + /* Begin the transaction. */
> + AcceptInvalidationMessages();
> + maybe_reread_subscription();
> 
> - MemoryContextSwitchTo(oldctx);
> + StartTransactionCommand();
> + BeginTransactionBlock();
> + CommitTransactionCommand();
>   }
> ...
> 
> Why do we need to start a transaction here? Why can't it be done via
> begin_replication_step() during the first operation apply? Is it because we may
> need to define a save point in bgworker and we don't that information
> beforehand? If so, then also, can't it be handled by
> begin_replication_step() either by explicitly passing the information or
> checking it there and then starting a transaction block? In any case, please add
> a few comments to explain why this separate handling is required for
> bgworker?

The transaction block is used to define the savepoint and I moved these
codes to the place where the savepoint is defined which looks better now.

> 7. When we are already setting bgworker status as APPLY_BGWORKER_BUSY in
> apply_bgworker_setup_dsm() then why do we need to set it again in
> apply_bgworker_start()?

Removed.

> 8. It is not clear to me how APPLY_BGWORKER_EXIT status is used. Is it required
> for the cases where bgworker exists due to some error and then apply worker
> uses it to detect that and exits? How other bgworkers would notice this, is it
> done via apply_bgworker_check_status()?

It was used to detect the unexpected exit of bgworker and I have changed the design
of this which is now similar to what we have in parallel query.

Attach the new version patch set(v24) which address above comments.
Besides, I added some logic which try to stop the bgworker at transaction end
if there are enough workers in the pool.

Best regards,
Hou zj

On Wednesday, August 24, 2022 9:47 PM houzj.fnst@fujitsu.com wrote:
> 
> On Friday, August 19, 2022 4:49 PM Amit Kapila <amit.kapila16@gmail.com>
> >
> > On Thu, Aug 18, 2022 at 5:14 PM Amit Kapila <amit.kapila16@gmail.com>
> > wrote:
> > >
> > > On Wed, Aug 17, 2022 at 11:58 AM wangw.fnst@fujitsu.com
> > > <wangw.fnst@fujitsu.com> wrote:
> > > >
> > > > Attach the new patches.
> > > >
> > >
> > > Few comments on v23-0001
> > > =======================
> > >
> >
> > Some more comments on v23-0001
> > ============================
> > 1.
> > static bool
> >  handle_streamed_transaction(LogicalRepMsgType action, StringInfo s) { ...
> > - /* not in streaming mode */
> > - if (!in_streamed_transaction)
> > + /* Not in streaming mode and not in apply background worker. */ if
> > + (!(in_streamed_transaction || am_apply_bgworker()))
> >   return false;
> >
> > This check appears a bit strange because ideally in bgworker
> > in_streamed_transaction should be false. I think we should set
> > in_streamed_transaction to true in apply_handle_stream_start() only
> > when we are going to write to file. Is there a reason for not doing the same?
> 
> No, I removed this.
> 
> > 2.
> > + {
> > + /* This is the main apply worker. */ ApplyBgworkerInfo *wstate =
> > + apply_bgworker_find(xid);
> > +
> > + /*
> > + * Check if we are processing this transaction using an apply
> > + * background worker and if so, send the changes to that worker.
> > + */
> > + if (wstate)
> > + {
> > + /* Send STREAM ABORT message to the apply background worker. */
> > + apply_bgworker_send_data(wstate, s->len, s->data);
> >
> > Why at some places the patch needs to separately fetch
> > ApplyBgworkerInfo whereas at other places it directly uses
> > stream_apply_worker to pass the data to bgworker.
> > 3. Why apply_handle_stream_abort() or apply_handle_stream_prepare()
> > doesn't use apply_bgworker_active() to identify whether it needs to
> > send the information to bgworker?
> 
> I think stream_apply_worker is only valid between STREAM_START and
> STREAM_END, But it seems it's not clear from the code. So I added some
> comments and slightly refactor the code.
> 
> 
> > 4. In apply_handle_stream_prepare(), apply_handle_stream_abort(), and
> > some other similar functions, the patch handles three cases (a) apply
> > background worker, (b) sending data to bgworker, (c) handling for
> > streamed transaction in apply worker. I think the code will look
> > better if you move the respective code for all three cases into
> > separate functions. Surely, if the code to deal with each of the cases is less then
> we don't need to move it to a separate function.
> 
> Refactored and simplified.
> 
> > 5.
> > @@ -1088,24 +1177,78 @@ apply_handle_stream_prepare(StringInfo s) { ...
> > + in_remote_transaction = false;
> > +
> > + /* Unlink the files with serialized changes and subxact info. */
> > + stream_cleanup_files(MyLogicalRepWorker->subid, prepare_data.xid); }
> > + }
> >
> >   in_remote_transaction = false;
> > ...
> >
> > We don't need to in_remote_transaction to false in multiple places.
> 
> Removed.
> 
> > 6.
> > @@ -1177,36 +1311,93 @@ apply_handle_stream_start(StringInfo s) { ...
> > ...
> > + if (am_apply_bgworker())
> >   {
> > - MemoryContext oldctx;
> > -
> > - oldctx = MemoryContextSwitchTo(ApplyContext);
> > + /*
> > + * Make sure the handle apply_dispatch methods are aware we're in a
> > + * remote transaction.
> > + */
> > + in_remote_transaction = true;
> >
> > - MyLogicalRepWorker->stream_fileset = palloc(sizeof(FileSet));
> > - FileSetInit(MyLogicalRepWorker->stream_fileset);
> > + /* Begin the transaction. */
> > + AcceptInvalidationMessages();
> > + maybe_reread_subscription();
> >
> > - MemoryContextSwitchTo(oldctx);
> > + StartTransactionCommand();
> > + BeginTransactionBlock();
> > + CommitTransactionCommand();
> >   }
> > ...
> >
> > Why do we need to start a transaction here? Why can't it be done via
> > begin_replication_step() during the first operation apply? Is it
> > because we may need to define a save point in bgworker and we don't
> > that information beforehand? If so, then also, can't it be handled by
> > begin_replication_step() either by explicitly passing the information
> > or checking it there and then starting a transaction block? In any
> > case, please add a few comments to explain why this separate handling
> > is required for bgworker?
> 
> The transaction block is used to define the savepoint and I moved these codes to
> the place where the savepoint is defined which looks better now.
> 
> > 7. When we are already setting bgworker status as APPLY_BGWORKER_BUSY
> > in
> > apply_bgworker_setup_dsm() then why do we need to set it again in
> > apply_bgworker_start()?
> 
> Removed.
> 
> > 8. It is not clear to me how APPLY_BGWORKER_EXIT status is used. Is it
> > required for the cases where bgworker exists due to some error and
> > then apply worker uses it to detect that and exits? How other
> > bgworkers would notice this, is it done via apply_bgworker_check_status()?
> 
> It was used to detect the unexpected exit of bgworker and I have changed the
> design of this which is now similar to what we have in parallel query.
> 
> Attach the new version patch set(v24) which address above comments.
> Besides, I added some logic which try to stop the bgworker at transaction end if
> there are enough workers in the pool.

Also attach the result of performance test based on v23 patch.

This test used synchronous logical replication, and compared SQL execution
times before and after applying the patch. This is tested by varying
logical_decoding_work_mem.

The test was performed ten times, and the average of the middle eight was taken.

The results are as follows. The bar chart and the details of the test are attached.

RESULT - bulk insert (5kk)
----------------------------------
logical_decoding_work_mem   64kB    128kB   256kB   512kB   1MB     2MB     4MB     8MB     16MB    32MB    64MB
HEAD                        46.940  46.428  46.663  46.373  46.339  46.838  50.346  50.536  50.452  50.582  47.491
patched                     33.942  33.780  30.760  30.760  29.992  30.076  30.827  33.420  33.966  34.133  31.096

For different logical_decoding_work_mem size, it takes
about 30% ~ 40% less time, which looks good.

Some other tests are still in progress, might share them later.

Best regards,
Hou zj

On Thursday, August 25, 2022 7:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Wed, Aug 24, 2022 at 7:17 PM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Friday, August 19, 2022 4:49 PM Amit Kapila <amit.kapila16@gmail.com>
> > >
> >
> > > 8. It is not clear to me how APPLY_BGWORKER_EXIT status is used. Is it
> required
> > > for the cases where bgworker exists due to some error and then apply
> worker
> > > uses it to detect that and exits? How other bgworkers would notice this, is
> it
> > > done via apply_bgworker_check_status()?
> >
> > It was used to detect the unexpected exit of bgworker and I have changed
> the design
> > of this which is now similar to what we have in parallel query.
> >
> 
> Thanks, this looks better.
> 
> > Attach the new version patch set(v24) which address above comments.
> > Besides, I added some logic which try to stop the bgworker at transaction
> end
> > if there are enough workers in the pool.
> >
> 
> I think this deserves an explanation in worker.c under the title:
> "Separate background workers" in the patch.
> 
> Review comments for v24-0001

Thanks for the comments.

> =========================
> 1.
> + * cost of searhing the hash table
> 
> /searhing/searching

Fixed.

> 2.
> +/*
> + * Apply background worker states.
> + */
> +typedef enum ApplyBgworkerState
> +{
> + APPLY_BGWORKER_BUSY, /* assigned to a transaction */
> + APPLY_BGWORKER_FINISHED /* transaction is completed */
> +} ApplyBgworkerState;
> 
> Now, that there are just two states, can we think to represent them
> via a flag ('available'/'in_use') or do you see a downside with that
> as compared to the current approach?

Changed to in_use.

> 3.
> -replorigin_session_setup(RepOriginId node)
> +replorigin_session_setup(RepOriginId node, int apply_leader_pid)
> 
> I have mentioned previously that we don't need anything specific to
> apply worker/leader in this API, so why this change? The other idea
> that occurred to me is that can we use replorigin_session_reset()
> before sending the commit message to bgworker and then do the session
> setup in bgworker only to handle the commit/abort/prepare message. We
> also need to set it again for the leader apply worker after the leader
> worker completes the wait for bgworker to finish the commit handling.

I have reverted the changes related to replorigin_session_setup and used
the suggested approach. I also did some simple performance tests for this approach
and didn't see some obvious overhead as the replorigin_session_setup is invoked
per streaming transaction.

> 4. Unlike parallel query, here we seem to be creating separate DSM for
> each worker, and probably the difference is due to the fact that here
> we don't know upfront how many workers will actually be required. If
> so, can we write some comments for the same in worker.c where you have
> explained about parallel bgwroker stuff?

Added.

> 5.
> /*
> - * Handle streamed transactions.
> + * Handle streamed transactions for both the main apply worker and the apply
> + * background workers.
> 
> Shall we use leader apply worker in the above comment? Also, check
> other places in the patch for similar changes.

Changed.

> 6.
> + else
> + {
> 
> - /* open the spool file for this transaction */
> - stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
> + /* notify handle methods we're processing a remote transaction */
> + in_streamed_transaction = true;
> 
> There is a spurious line after else {. Also, the comment could be
> slightly improved: "/* notify handle methods we're processing a remote
> in-progress transaction */"

Changed.

> 7. The checks in various apply_handle_stream_* functions have improved
> as compared to the previous version but I think we can still improve
> those. One idea could be to use a separate function to decide the
> action we want to take and then based on it, the caller can take
> appropriate action. Using a similar idea, we can improve the checks in
> handle_streamed_transaction() as well.

Improved as suggested.

> 8.
> + else if ((winfo = apply_bgworker_find(xid)))
> + {
> + /* Send STREAM ABORT message to the apply background worker. */
> + apply_bgworker_send_data(winfo, s->len, s->data);
> +
> + /*
> + * After sending the data to the apply background worker, wait for
> + * that worker to finish. This is necessary to maintain commit
> + * order which avoids failures due to transaction dependencies and
> + * deadlocks.
> + */
> + if (subxid == xid)
> + {
> + apply_bgworker_wait_for(winfo, APPLY_BGWORKER_FINISHED);
> + apply_bgworker_free(winfo);
> + }
> + }
> + else
> + /*
> + * We are in main apply worker and the transaction has been
> + * serialized to file.
> + */
> + serialize_stream_abort(xid, subxid);
> 
> In the last else block, you can use {} to make it consistent with
> other if, else checks.
> 
> 9.
> +void
> +ApplyBgworkerMain(Datum main_arg)
> +{
> + volatile ApplyBgworkerShared *shared;
> +
> + dsm_handle handle;
> 
> Is there a need to keep this empty line between the above two declarations?

Removed.

> 10.
> + /*
> + * Attach to the message queue.
> + */
> + mq = shm_toc_lookup(toc, APPLY_BGWORKER_KEY_ERROR_QUEUE, false);
> 
> Here, we should say error queue in the comments.

Fixed.

> 11.
> + /*
> + * Attach to the message queue.
> + */
> + mq = shm_toc_lookup(toc, APPLY_BGWORKER_KEY_ERROR_QUEUE, false);
> + shm_mq_set_sender(mq, MyProc);
> + error_mqh = shm_mq_attach(mq, seg, NULL);
> + pq_redirect_to_shm_mq(seg, error_mqh);
> +
> + /*
> + * Now, we have initialized DSM. Attach to slot.
> + */
> + logicalrep_worker_attach(worker_slot);
> + MyParallelShared->logicalrep_worker_generation =
> MyLogicalRepWorker->generation;
> + MyParallelShared->logicalrep_worker_slot_no = worker_slot;
> +
> + pq_set_parallel_leader(MyLogicalRepWorker->apply_leader_pid,
> +    InvalidBackendId);
> 
> Is there a reason to set parallel_leader immediately after
> pq_redirect_to_shm_mq() as we are doing parallel.c?

Moved the code.

> 12.
> if (pq_mq_parallel_leader_pid != 0)
> + {
>   SendProcSignal(pq_mq_parallel_leader_pid,
>      PROCSIG_PARALLEL_MESSAGE,
>      pq_mq_parallel_leader_backend_id);
> 
> + /*
> + * XXX maybe we can reuse the PROCSIG_PARALLEL_MESSAGE instead of
> + * introducing a new signal reason.
> + */
> + SendProcSignal(pq_mq_parallel_leader_pid,
> +    PROCSIG_APPLY_BGWORKER_MESSAGE,
> +    pq_mq_parallel_leader_backend_id);
> + }
> 
> I think we don't need to send both signals. Here, we can check if this
> is a parallel worker (IsParallelWorker), then send
> PROCSIG_PARALLEL_MESSAGE, otherwise, send
> PROCSIG_APPLY_BGWORKER_MESSAGE message. In the else part, we can have
> an assert to ensure it is an apply bgworker.

Changed.


Attach the new version patch set which addressed the above comments
and comments from Amit[1] and Kuroda-san[2].

As discussed, I also renamed all the "apply background worker" and
related stuff to "apply parallel worker".

[1] https://www.postgresql.org/message-id/CAA4eK1%2B_oHZHoDooAR7QcYD2CeTUWNSwkqVcLWC2iQijAJC4Cg%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/TYAPR01MB58666A97D40AB8919D106AD5F5709%40TYAPR01MB5866.jpnprd01.prod.outlook.com

Best regards,
Hou zj

On Tuesday, August 30, 2022 7:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Tue, Aug 30, 2022 at 12:12 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > Few other comments on v25-0001*
> > ============================
> >
> 
> Some more comments on v25-0001*:
> =============================
> 1.
> +static void
> +apply_handle_stream_abort(StringInfo s)
> ...
> ...
> + else if (apply_action == TA_SEND_TO_PARALLEL_WORKER) { if (subxid ==
> + xid) parallel_apply_replorigin_reset();
> +
> + /* Send STREAM ABORT message to the apply parallel worker. */
> + parallel_apply_send_data(winfo, s->len, s->data);
> +
> + /*
> + * After sending the data to the apply parallel worker, wait for
> + * that worker to finish. This is necessary to maintain commit
> + * order which avoids failures due to transaction dependencies and
> + * deadlocks.
> + */
> + if (subxid == xid)
> + {
> + parallel_apply_wait_for_free(winfo);
> ...
> ...
> 
> From this code, it appears that we are waiting for rollbacks to finish but not
> doing the same in the rollback to savepoint cases. Is there a reason for the
> same? I think we need to wait for rollbacks to avoid transaction dependency
> and deadlock issues. Consider the below case:
> 
> Consider table t1 (c1 primary key, c2, c3) has a row (1, 2, 3) on both publisher and
> subscriber.
> 
> Publisher
> Session-1
> ==========
> Begin;
> ...
> Delete from t1 where c1 = 1;
> 
> Session-2
> Begin;
> ...
> insert into t1 values(1, 4, 5); --This will wait for Session-1's Delete to finish.
> 
> Session-1
> Rollback;
> 
> Session-2
> -- The wait will be finished and the insert will be successful.
> Commit;
> 
> Now, assume both these transactions get streamed and if we didn't wait for
> rollback/rollback to savepoint, it is possible that the insert gets executed
> before and leads to a constraint violation. This won't happen in non-parallel
> mode, so we should wait for rollbacks to finish.

Agreed and changed.

> 2. I think we don't need to wait at Rollback Prepared/Commit Prepared
> because we wait for prepare to finish in *_stream_prepare function.
> That will ensure all the operations in that transaction have happened in the
> subscriber, so no concurrent transaction can create deadlock or transaction
> dependency issues. If so, I think it is better to explain this in the comments.

Added some comments about this.

> 3.
> +/* What action to take for the transaction. */ typedef enum
>  {
> - LogicalRepMsgType command; /* 0 if invalid */
> - LogicalRepRelMapEntry *rel;
> + /* The action for non-streaming transactions. */
> + TA_APPLY_IN_LEADER_WORKER,
> 
> - /* Remote node information */
> - int remote_attnum; /* -1 if invalid */
> - TransactionId remote_xid;
> - XLogRecPtr finish_lsn;
> - char    *origin_name;
> -} ApplyErrorCallbackArg;
> + /* Actions for streaming transactions. */  TA_SERIALIZE_TO_FILE,
> +TA_APPLY_IN_PARALLEL_WORKER,  TA_SEND_TO_PARALLEL_WORKER }
> +TransactionApplyAction;
> 
> I think each action needs explanation atop this enum typedef.

Added.

> 4.
> @@ -1149,24 +1315,14 @@ static void
>  apply_handle_stream_start(StringInfo s) { ...
> + else if (apply_action == TA_SERIALIZE_TO_FILE) {
> + /*
> + * For the first stream start, check if there is any free apply
> + * parallel worker we can use to process this transaction.
> + */
> + if (first_segment)
> + winfo = parallel_apply_start_worker(stream_xid);
> 
> - /* open the spool file for this transaction */
> - stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
> + if (winfo)
> + {
> + /*
> + * If we have found a free worker, then we pass the data to that
> + * worker.
> + */
> + parallel_apply_send_data(winfo, s->len, s->data);
> 
> - /* if this is not the first segment, open existing subxact file */
> - if (!first_segment)
> - subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
> + nchanges = 0;
> 
> - pgstat_report_activity(STATE_RUNNING, NULL);
> + /* Cache the apply parallel worker for this transaction. */
> + stream_apply_worker = winfo; }
> ...
> 
> This looks odd to me in the sense that even if the action is
> TA_SERIALIZE_TO_FILE, we still send the information to the parallel
> worker. Won't it be better if we call parallel_apply_start_worker()
> for first_segment before checking apply_action with
> get_transaction_apply_action(). That way we can avoid this special
> case handling.

Changed as suggested.

> 5.
> +/*
> + * Struct for sharing information between apply leader apply worker and apply
> + * parallel workers.
> + */
> +typedef struct ApplyParallelWorkerShared
> +{
> + slock_t mutex;
> +
> + bool in_use;
> +
> + /* Logical protocol version. */
> + uint32 proto_version;
> +
> + TransactionId stream_xid;
> 
> Are we using stream_xid passed by the leader in parallel worker? If
> so, how? If not, then can we do without this?

No, it seems we don't need this. Removed.

> 6.
> +void
> +HandleParallelApplyMessages(void)
> {
> ...
> + /* OK to process messages.  Reset the flag saying there are more to do. */
> + ParallelApplyMessagePending = false;
> 
> I don't understand the meaning of the second part of the comment.
> Shouldn't we say: "Reset the flag saying there is nothing more to
> do."? I know you have copied from the other part of the code but there
> also I am not sure if it is correct.

I feel the comment here is not very helpful, so I removed this.

> 7.
> +static List *ApplyParallelWorkersFreeList = NIL;
> +static List *ApplyParallelWorkersList = NIL;
> 
> Do we really need to maintain two different workers' lists? If so,
> what is the advantage? I think there won't be many parallel apply
> workers, so even if maintain one list and search it, there shouldn't
> be any performance impact. I feel maintaining two lists for this
> purpose is a bit complex and has more chances of bugs, so we should
> try to avoid it if possible.

Agreed, I removed the ApplyParallelWorkersList and reused
ApplyParallelWorkersList in other places.

Attach the new version patch set which addressed above comments
and comments from[1].

[1] https://www.postgresql.org/message-id/CAA4eK1%2Be8JsiC8uMZPU25xQRyxNvVS24M4%3DZy-xD18jzX%2BvrmA%40mail.gmail.com

Best regards,
Hou zj

Attachment

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

01 September 2022, 11:23:22

On Wednesday, August 31, 2022 5:56 PM houzj.fnst@fujitsu.com wrote:
> 
> On Tuesday, August 30, 2022 7:51 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > On Tue, Aug 30, 2022 at 12:12 PM Amit Kapila <amit.kapila16@gmail.com>
> > wrote:
> > >
> > > Few other comments on v25-0001*
> > > ============================
> > >
> >
> > Some more comments on v25-0001*:
> > =============================
> > 1.
> > +static void
> > +apply_handle_stream_abort(StringInfo s)
> > ...
> > ...
> > + else if (apply_action == TA_SEND_TO_PARALLEL_WORKER) { if (subxid ==
> > + xid) parallel_apply_replorigin_reset();
> > +
> > + /* Send STREAM ABORT message to the apply parallel worker. */
> > + parallel_apply_send_data(winfo, s->len, s->data);
> > +
> > + /*
> > + * After sending the data to the apply parallel worker, wait for
> > + * that worker to finish. This is necessary to maintain commit
> > + * order which avoids failures due to transaction dependencies and
> > + * deadlocks.
> > + */
> > + if (subxid == xid)
> > + {
> > + parallel_apply_wait_for_free(winfo);
> > ...
> > ...
> >
> > From this code, it appears that we are waiting for rollbacks to finish
> > but not doing the same in the rollback to savepoint cases. Is there a
> > reason for the same? I think we need to wait for rollbacks to avoid
> > transaction dependency and deadlock issues. Consider the below case:
> >
> > Consider table t1 (c1 primary key, c2, c3) has a row (1, 2, 3) on both
> > publisher and subscriber.
> >
> > Publisher
> > Session-1
> > ==========
> > Begin;
> > ...
> > Delete from t1 where c1 = 1;
> >
> > Session-2
> > Begin;
> > ...
> > insert into t1 values(1, 4, 5); --This will wait for Session-1's Delete to finish.
> >
> > Session-1
> > Rollback;
> >
> > Session-2
> > -- The wait will be finished and the insert will be successful.
> > Commit;
> >
> > Now, assume both these transactions get streamed and if we didn't wait
> > for rollback/rollback to savepoint, it is possible that the insert
> > gets executed before and leads to a constraint violation. This won't
> > happen in non-parallel mode, so we should wait for rollbacks to finish.
> 
> Agreed and changed.
> 
> > 2. I think we don't need to wait at Rollback Prepared/Commit Prepared
> > because we wait for prepare to finish in *_stream_prepare function.
> > That will ensure all the operations in that transaction have happened
> > in the subscriber, so no concurrent transaction can create deadlock or
> > transaction dependency issues. If so, I think it is better to explain this in the
> comments.
> 
> Added some comments about this.
> 
> > 3.
> > +/* What action to take for the transaction. */ typedef enum
> >  {
> > - LogicalRepMsgType command; /* 0 if invalid */
> > - LogicalRepRelMapEntry *rel;
> > + /* The action for non-streaming transactions. */
> > + TA_APPLY_IN_LEADER_WORKER,
> >
> > - /* Remote node information */
> > - int remote_attnum; /* -1 if invalid */
> > - TransactionId remote_xid;
> > - XLogRecPtr finish_lsn;
> > - char    *origin_name;
> > -} ApplyErrorCallbackArg;
> > + /* Actions for streaming transactions. */  TA_SERIALIZE_TO_FILE,
> > +TA_APPLY_IN_PARALLEL_WORKER,  TA_SEND_TO_PARALLEL_WORKER }
> > +TransactionApplyAction;
> >
> > I think each action needs explanation atop this enum typedef.
> 
> Added.
> 
> > 4.
> > @@ -1149,24 +1315,14 @@ static void
> >  apply_handle_stream_start(StringInfo s) { ...
> > + else if (apply_action == TA_SERIALIZE_TO_FILE) {
> > + /*
> > + * For the first stream start, check if there is any free apply
> > + * parallel worker we can use to process this transaction.
> > + */
> > + if (first_segment)
> > + winfo = parallel_apply_start_worker(stream_xid);
> >
> > - /* open the spool file for this transaction */
> > - stream_open_file(MyLogicalRepWorker->subid, stream_xid,
> > first_segment);
> > + if (winfo)
> > + {
> > + /*
> > + * If we have found a free worker, then we pass the data to that
> > + * worker.
> > + */
> > + parallel_apply_send_data(winfo, s->len, s->data);
> >
> > - /* if this is not the first segment, open existing subxact file */
> > - if (!first_segment)
> > - subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
> > + nchanges = 0;
> >
> > - pgstat_report_activity(STATE_RUNNING, NULL);
> > + /* Cache the apply parallel worker for this transaction. */
> > + stream_apply_worker = winfo; }
> > ...
> >
> > This looks odd to me in the sense that even if the action is
> > TA_SERIALIZE_TO_FILE, we still send the information to the parallel
> > worker. Won't it be better if we call parallel_apply_start_worker()
> > for first_segment before checking apply_action with
> > get_transaction_apply_action(). That way we can avoid this special
> > case handling.
> 
> Changed as suggested.
> 
> > 5.
> > +/*
> > + * Struct for sharing information between apply leader apply worker
> > +and apply
> > + * parallel workers.
> > + */
> > +typedef struct ApplyParallelWorkerShared {  slock_t mutex;
> > +
> > + bool in_use;
> > +
> > + /* Logical protocol version. */
> > + uint32 proto_version;
> > +
> > + TransactionId stream_xid;
> >
> > Are we using stream_xid passed by the leader in parallel worker? If
> > so, how? If not, then can we do without this?
> 
> No, it seems we don't need this. Removed.
> 
> > 6.
> > +void
> > +HandleParallelApplyMessages(void)
> > {
> > ...
> > + /* OK to process messages.  Reset the flag saying there are more to
> > + do. */ ParallelApplyMessagePending = false;
> >
> > I don't understand the meaning of the second part of the comment.
> > Shouldn't we say: "Reset the flag saying there is nothing more to
> > do."? I know you have copied from the other part of the code but there
> > also I am not sure if it is correct.
> 
> I feel the comment here is not very helpful, so I removed this.
> 
> > 7.
> > +static List *ApplyParallelWorkersFreeList = NIL; static List
> > +*ApplyParallelWorkersList = NIL;
> >
> > Do we really need to maintain two different workers' lists? If so,
> > what is the advantage? I think there won't be many parallel apply
> > workers, so even if maintain one list and search it, there shouldn't
> > be any performance impact. I feel maintaining two lists for this
> > purpose is a bit complex and has more chances of bugs, so we should
> > try to avoid it if possible.
> 
> Agreed, I removed the ApplyParallelWorkersList and reused
> ApplyParallelWorkersList in other places.
> 
> Attach the new version patch set which addressed above comments and
> comments from[1].
> 
> [1]
> https://www.postgresql.org/message-id/CAA4eK1%2Be8JsiC8uMZPU25xQRy
> xNvVS24M4%3DZy-xD18jzX%2BvrmA%40mail.gmail.com

Attach a new version patch set which fixes some typos and some cosmetic things.

Best regards,
Hou zj

Attachment

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

02 September 2022, 06:09:53

On Thu, Sep 1, 2022 at 4:53 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>

Review of v27-0001*:
================
1. I feel the usage of in_remote_transaction and in_use flags is
slightly complex. IIUC, the patch uses in_use flag to ensure commit
ordering by waiting for it to become false before proceeding in
transaction finish commands in leader apply worker. If so, I think it
is better to name it in_parallel_apply_xact and set it to true only
when we start applying xact in parallel apply worker and set it to
false when we finish the xact in parallel apply worker. It can be
initialized to false while setting up DSM. Also, accordingly change
the function parallel_apply_wait_for_free() to
parallel_apply_wait_for_xact_finish and parallel_apply_set_idle to
parallel_apply_set_xact_finish. We can change the name of the
in_remote_transaction flag to in_use.

Please explain about these flags in the struct where they are declared.

2. The worker_id in ParallelApplyWorkerShared struct could have wrong
information after the worker is reused from the pool. Because we could
have removed some other worker from the ParallelApplyWorkersList which
will make the value of worker_id wrong. For error/debug messages, we
can probably use LSN if available or can oid of subscription if
required. I thought of using xid as well but I think it is better to
avoid that in messages as it can wraparound. See, if the patch uses
xid in other messages, it is better to either use it along with LSN or
try to use only LSN.

3.
elog(ERROR, "[Parallel Apply Worker #%u] unexpected message \"%c\"",
+ shared->worker_id, c);

Also, I am not sure whether the above style (use of []) of messages is
good. Did you follow the usage from some other place?

4.
apply_handle_stream_stop(StringInfo s)
{
...
+ if (apply_action == TA_APPLY_IN_PARALLEL_WORKER)
+ {
+ elog(DEBUG1, "[Parallel Apply Worker #%u] ended processing streaming chunk, "
+ "waiting on shm_mq_receive", MyParallelShared->worker_id);
...

I don't understand the relevance of "waiting on shm_mq_receive" in the
above message because AFAICS, here we are not waiting on any receive
call.

5. I suggest you please go through all the ERROR/LOG/DEBUG messages in
the patch and try to improve them based on the above comments.

6.
+ * The dynamic shared memory segment will contain (1) a shm_mq that can be used
+ * to send errors (and other messages reported via elog/ereport) from the
+ * parallel apply worker to leader apply worker (2) another shm_mq that can be
+ * used to send changes in the transaction from leader apply worker to parallel
+ * apply worker

Here, it would be better to switch (1) and (2). I feel it is better to
explain first about how the main apply information is exchanged among
workers.

7.
+ /* Try to get a free parallel apply worker. */
+ foreach(lc, ParallelApplyWorkersList)
+ {
+ ParallelApplyWorkerInfo *tmp_winfo;
+
+ tmp_winfo = (ParallelApplyWorkerInfo *) lfirst(lc);
+
+ if (tmp_winfo->error_mq_handle == NULL)
+ {
+ /*
+ * Release the worker information and try next one if the parallel
+ * apply worker exited cleanly.
+ */
+ ParallelApplyWorkersList =
foreach_delete_current(ParallelApplyWorkersList, lc);
+ shm_mq_detach(tmp_winfo->mq_handle);
+ dsm_detach(tmp_winfo->dsm_seg);
+ pfree(tmp_winfo);
+
+ continue;
+ }
+
+ if (!tmp_winfo->in_remote_transaction)
+ {
+ winfo = tmp_winfo;
+ break;
+ }
+ }

Can we write it as if ... else if? If so, then we don't need to
continue in the first loop. And, can we add some more comments to
explain these cases?

-- 
With Regards,
Amit Kapila.

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

05 September 2022, 12:40:34

On Friday, September 2, 2022 2:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Thu, Sep 1, 2022 at 4:53 PM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> 
> Review of v27-0001*:

Thanks for the comments.

> ================
> 1. I feel the usage of in_remote_transaction and in_use flags is slightly complex.
> IIUC, the patch uses in_use flag to ensure commit ordering by waiting for it to
> become false before proceeding in transaction finish commands in leader
> apply worker. If so, I think it is better to name it in_parallel_apply_xact and set it
> to true only when we start applying xact in parallel apply worker and set it to
> false when we finish the xact in parallel apply worker. It can be initialized to false
> while setting up DSM. Also, accordingly change the function
> parallel_apply_wait_for_free() to parallel_apply_wait_for_xact_finish and
> parallel_apply_set_idle to parallel_apply_set_xact_finish. We can change the
> name of the in_remote_transaction flag to in_use.

Agreed. One thing I found when addressing this is that there could be a race
condition if we want to set the flag in parallel apply worker:

where the leader has already started waiting for the parallel apply worker to
finish processing the transaction(set the in_parallel_apply_xact to false)
while the child process has not yet processed the first STREAM_START and has
not set the in_parallel_apply_xact to true.

> Please explain about these flags in the struct where they are declared.
> 
> 2. The worker_id in ParallelApplyWorkerShared struct could have wrong
> information after the worker is reused from the pool. Because we could have
> removed some other worker from the ParallelApplyWorkersList which will
> make the value of worker_id wrong. For error/debug messages, we can
> probably use LSN if available or can oid of subscription if required. I thought of
> using xid as well but I think it is better to avoid that in messages as it can
> wraparound. See, if the patch uses xid in other messages, it is better to either
> use it along with LSN or try to use only LSN.
> 3.
> elog(ERROR, "[Parallel Apply Worker #%u] unexpected message \"%c\"",
> + shared->worker_id, c);
> 
> Also, I am not sure whether the above style (use of []) of messages is good. Did
> you follow the usage from some other place?
> 4.
> apply_handle_stream_stop(StringInfo s)
> {
> ...
> + if (apply_action == TA_APPLY_IN_PARALLEL_WORKER) { elog(DEBUG1,
> + "[Parallel Apply Worker #%u] ended processing streaming chunk, "
> + "waiting on shm_mq_receive", MyParallelShared->worker_id);
> ...
> 
> I don't understand the relevance of "waiting on shm_mq_receive" in the
> above message because AFAICS, here we are not waiting on any receive
> call.
> 
> 5. I suggest you please go through all the ERROR/LOG/DEBUG messages in
> the patch and try to improve them based on the above comments.

I removed the worker_id and also removed and improved some DEBUG/ERROR
messages which I think is not clear or we don't have similar message in existing code.

> 6.
> + * The dynamic shared memory segment will contain (1) a shm_mq that can be
> used
> + * to send errors (and other messages reported via elog/ereport) from the
> + * parallel apply worker to leader apply worker (2) another shm_mq that can
> be
> + * used to send changes in the transaction from leader apply worker to parallel
> + * apply worker
> 
> Here, it would be better to switch (1) and (2). I feel it is better to
> explain first about how the main apply information is exchanged among
> workers.

Exchanged.

> 7.
> + /* Try to get a free parallel apply worker. */
> + foreach(lc, ParallelApplyWorkersList)
> + {
> + ParallelApplyWorkerInfo *tmp_winfo;
> +
> + tmp_winfo = (ParallelApplyWorkerInfo *) lfirst(lc);
> +
> + if (tmp_winfo->error_mq_handle == NULL)
> + {
> + /*
> + * Release the worker information and try next one if the parallel
> + * apply worker exited cleanly.
> + */
> + ParallelApplyWorkersList =
> foreach_delete_current(ParallelApplyWorkersList, lc);
> + shm_mq_detach(tmp_winfo->mq_handle);
> + dsm_detach(tmp_winfo->dsm_seg);
> + pfree(tmp_winfo);
> +
> + continue;
> + }
> +
> + if (!tmp_winfo->in_remote_transaction)
> + {
> + winfo = tmp_winfo;
> + break;
> + }
> + }
> 
> Can we write it as if ... else if? If so, then we don't need to
> continue in the first loop. And, can we add some more comments to
> explain these cases?

Changed.


Attach the new version patch set which addressed above comments and
also fixed another problem while subscriber to a low version publisher.

Best regards,
Hou zj

Attachment

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

05 September 2022, 13:04:33

On Monday, September 5, 2022 8:41 PM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote:
> 
> On Friday, September 2, 2022 2:10 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > On Thu, Sep 1, 2022 at 4:53 PM houzj.fnst@fujitsu.com
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> >
> > Review of v27-0001*:
> 
> Thanks for the comments.
> 
> > ================
> > 1. I feel the usage of in_remote_transaction and in_use flags is slightly complex.
> > IIUC, the patch uses in_use flag to ensure commit ordering by waiting
> > for it to become false before proceeding in transaction finish
> > commands in leader apply worker. If so, I think it is better to name
> > it in_parallel_apply_xact and set it to true only when we start
> > applying xact in parallel apply worker and set it to false when we
> > finish the xact in parallel apply worker. It can be initialized to
> > false while setting up DSM. Also, accordingly change the function
> > parallel_apply_wait_for_free() to parallel_apply_wait_for_xact_finish
> > and parallel_apply_set_idle to parallel_apply_set_xact_finish. We can
> > change the name of the in_remote_transaction flag to in_use.
> 
> Agreed. One thing I found when addressing this is that there could be a race
> condition if we want to set the flag in parallel apply worker:
> 
> where the leader has already started waiting for the parallel apply worker to
> finish processing the transaction(set the in_parallel_apply_xact to false) while the
> child process has not yet processed the first STREAM_START and has not set the
> in_parallel_apply_xact to true.

Sorry, I didn’t complete this sentence. I meant it's safer to set this flag in apply leader,
So I changed the code like that and added some comments to explain the same.

...
> 
> Attach the new version patch set which addressed above comments and also
> fixed another problem while subscriber to a low version publisher.

Attach the correct patch set this time.

Best regards,
Hou zj

Attachment

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

08 September 2022, 06:51:49

On Mon, Sep 5, 2022 at 6:34 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> Attach the correct patch set this time.
>

Few comments on v28-0001*:
=======================
1.
+ /* Whether the worker is processing a transaction. */
+ bool in_use;

I think this same comment applies to in_parallel_apply_xact flag as
well. How about: "Indicates whether the worker is available to be used
for parallel apply transaction?"?

2.
+ /*
+ * Set this flag in the leader instead of the parallel apply worker to
+ * avoid the race condition where the leader has already started waiting
+ * for the parallel apply worker to finish processing the transaction(set
+ * the in_parallel_apply_xact to false) while the child process has not yet
+ * processed the first STREAM_START and has not set the
+ * in_parallel_apply_xact to true.

I think part of this comment "(set the in_parallel_apply_xact to
false)" is not necessary. It will be clear without that.

3.
+ /* Create entry for requested transaction. */
+ entry = hash_search(ParallelApplyWorkersHash, &xid, HASH_ENTER, &found);
+ if (found)
+ elog(ERROR, "hash table corrupted");
...
...
+ hash_search(ParallelApplyWorkersHash, &xid, HASH_REMOVE, NULL);

It is better to have a similar elog for HASH_REMOVE case as well. We
normally seem to have such elog for HASH_REMOVE.

4.
* Parallel apply is not supported when subscribing to a publisher which
+     * cannot provide the abort_time, abort_lsn and the column information used
+     * to verify the parallel apply safety.

In this comment, which column information are you referring to?

5.
+ /*
+ * Set in_parallel_apply_xact to true again as we only aborted the
+ * subtransaction and the top transaction is still in progress. No
+ * need to lock here because currently only the apply leader are
+ * accessing this flag.
+ */
+ winfo->shared->in_parallel_apply_xact = true;

This theory sounds good to me but I think it is better to update/read
this flag under spinlock as the patch is doing at a few other places.
I think that will make the code easier to follow without worrying too
much about such special cases. There are a few asserts as well which
read this without lock, it would be better to change those as well.

6.
+ * LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM is the minimum protocol version
+ * with support for streaming large transactions using parallel apply
+ * workers. Introduced in PG16.

How about changing it to something like:
"LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM is the minimum protocol
version where we support applying large streaming transactions in
parallel. Introduced in PG16."

7.
+ PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+ bool write_abort_lsn = (data->protocol_version >=
+ LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM);

  /*
  * The abort should happen outside streaming block, even for streamed
@@ -1856,7 +1859,8 @@ pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
  Assert(rbtxn_is_streamed(toptxn));

  OutputPluginPrepareWrite(ctx, true);
- logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+ logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn, abort_lsn,
+   write_abort_lsn);

I think we need to send additional information if the client has used
the parallel streaming option. Also, let's keep sending subxid as we
were doing previously and add additional parameters required. It may
be better to name write_abort_lsn as abort_info.

8.
+ /*
+ * Check whether the publisher sends abort_lsn and abort_time.
+ *
+ * Note that the paralle apply worker is only started when the publisher
+ * sends abort_lsn and abort_time.
+ */
+ if (am_parallel_apply_worker() ||
+ walrcv_server_version(LogRepWorkerWalRcvConn) >= 160000)
+ read_abort_lsn = true;
+
+ logicalrep_read_stream_abort(s, &abort_data, read_abort_lsn);

This check should match with the check for the write operation where
we are checking the protocol version as well. There is a typo as well
in the comments (/paralle/parallel).

-- 
With Regards,
Amit Kapila.

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

08 September 2022, 11:24:38

On Thu, Sep 8, 2022 at 12:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Sep 5, 2022 at 6:34 PM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Attach the correct patch set this time.
> >
>
> Few comments on v28-0001*:
> =======================
>

Some suggestions for comments in v28-0001*
1.
+/*
+ * Entry for a hash table we use to map from xid to the parallel apply worker
+ * state.
+ */
+typedef struct ParallelApplyWorkerEntry

Let's change this comment to: "Hash table entry to map xid to the
parallel apply worker state."

2.
+/*
+ * List that stores the information of parallel apply workers that were
+ * started. Newly added worker information will be removed from the list at the
+ * end of the transaction when there are enough workers in the pool. Besides,
+ * exited workers will be removed from the list after being detected.
+ */
+static List *ParallelApplyWorkersList = NIL;

Can we change this to: "A list to maintain the active parallel apply
workers. The information for the new worker is added to the list after
successfully launching it. The list entry is removed at the end of the
transaction if there are already enough workers in the worker pool.
For more information about the worker pool, see comments atop
worker.c. We also remove the entry from the list if the worker is
exited due to some error."

Apart from this, I have added/changed a few other comments in
v28-0001*. Kindly check the attached, if you are fine with it then
please include it in the next version.

-- 
With Regards,
Amit Kapila.

Attachment

change_parallel_apply_comments_amit_1.patch

Re: Perform streaming logical transactions by background workers and parallel apply

From

Peter Smith

Date:

09 September 2022, 07:02:16

Here are my review comments for the v28-0001 patch:

(There may be some overlap with other people's review comments and/or
some fixes already made).

======

1. Commit Message

In addition, the patch extends the logical replication STREAM_ABORT message so
that abort_time and abort_lsn can also be sent which can be used to update the
replication origin in parallel apply worker when the streaming transaction is
aborted.

~

Should this also mention that because this message extension is needed
to support parallel streaming, meaning that parallel streaming is not
supported for publications on servers < PG16?

======

2. doc/src/sgml/config.sgml

        <para>
         Specifies maximum number of logical replication workers. This includes
-        both apply workers and table synchronization workers.
+        apply leader workers, parallel apply workers, and table synchronization
+        workers.
        </para>
"apply leader workers" -> "leader apply workers"

~~~

3.

max_logical_replication_workers (integer)
    Specifies maximum number of logical replication workers. This
includes apply leader workers, parallel apply workers, and table
synchronization workers.
    Logical replication workers are taken from the pool defined by
max_worker_processes.
    The default value is 4. This parameter can only be set at server start.

~

I did not really understand why the default is 4. Because the  default
tablesync workers is 2, and the default parallel workers is 2, but
what about accounting for the apply worker? Therefore, shouldn't
max_logical_replication_workers default be 5 instead of 4?

======

4. src/backend/commands/subscriptioncmds.c - defGetStreamingMode

+ }
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("%s requires a Boolean value or \"parallel\"",
+ def->defname)));
+ return SUBSTREAM_OFF; /* keep compiler quiet */
+}

Some whitespace before the ereport and the return might be tidier.

======

5. src/backend/libpq/pqmq.c

+ {
+ if (IsParallelWorker())
+ SendProcSignal(pq_mq_parallel_leader_pid,
+    PROCSIG_PARALLEL_MESSAGE,
+    pq_mq_parallel_leader_backend_id);
+ else
+ {
+ Assert(IsLogicalParallelApplyWorker());
+ SendProcSignal(pq_mq_parallel_leader_pid,
+    PROCSIG_PARALLEL_APPLY_MESSAGE,
+    pq_mq_parallel_leader_backend_id);
+ }
+ }

This code can be simplified if you want to. For example,

{
ProcSignalReason reason;
Assert(IsParallelWorker() || IsLogicalParallelApplyWorker());
reason = IsParallelWorker() ? PROCSIG_PARALLEL_MESSAGE :
PROCSIG_PARALLEL_APPLY_MESSAGE;
SendProcSignal(pq_mq_parallel_leader_pid, reason,
   pq_mq_parallel_leader_backend_id);
}

======

6. src/backend/replication/logical/applyparallelworker.c

Is there a reason why this file is called applyparallelworker.c
instead of parallelapplyworker.c? Now this name is out of step with
names of all the new typedefs etc.

~~~

7.

+/*
+ * There are three fields in each message received by parallel apply worker:
+ * start_lsn, end_lsn and send_time. Because we have updated these statistics
+ * in leader apply worker, we could ignore these fields in parallel apply
+ * worker (see function LogicalRepApplyLoop).
+ */
+#define SIZE_STATS_MESSAGE (2 * sizeof(XLogRecPtr) + sizeof(TimestampTz))

SUGGESTION (Just dded word "the" and change "could" -> "can")
There are three fields in each message received by the parallel apply
worker: start_lsn, end_lsn and send_time. Because we have updated
these statistics in the leader apply worker, we can ignore these
fields in the parallel apply worker (see function
LogicalRepApplyLoop).

~~~

8.

+/*
+ * List that stores the information of parallel apply workers that were
+ * started. Newly added worker information will be removed from the list at the
+ * end of the transaction when there are enough workers in the pool. Besides,
+ * exited workers will be removed from the list after being detected.
+ */
+static List *ParallelApplyWorkersList = NIL;

Perhaps this comment can give more explanation of what is meant by the
part that says "when there are enough workers in the pool".

~~~

9. src/backend/replication/logical/applyparallelworker.c -
parallel_apply_can_start

+ /*
+ * Don't start a new parallel worker if not in streaming parallel mode.
+ */
+ if (MySubscription->stream != SUBSTREAM_PARALLEL)
+ return false;

"streaming parallel mode." -> "parallel streaming mode."

~~~

10.

+ /*
+ * For streaming transactions that are being applied using parallel apply
+ * worker, we cannot decide whether to apply the change for a relation that
+ * is not in the READY state (see should_apply_changes_for_rel) as we won't
+ * know remote_final_lsn by that time. So, we don't start the new parallel
+ * apply worker in this case.
+ */
+ if (!AllTablesyncsReady())
+ return false;

"using parallel apply worker" -> "using a parallel apply worker"

~~~

11.

+ /*
+ * Do not allow parallel apply worker to be started in the parallel apply
+ * worker.
+ */
+ if (am_parallel_apply_worker())
+ return false;

I guess the comment is valid but it sounds strange.

SUGGESTION
Only leader apply workers can start parallel apply workers.

~~~

12.

+ if (am_parallel_apply_worker())
+ return false;

Maybe this code should be earlier in this function, because surely
this is a less costly test than the test for !AllTablesyncsReady()?

~~~

13. src/backend/replication/logical/applyparallelworker.c -
parallel_apply_start_worker

+/*
+ * Start a parallel apply worker that will be used for the specified xid.
+ *
+ * If a parallel apply worker is not in use then re-use it, otherwise start a
+ * fresh one. Cache the worker information in ParallelApplyWorkersHash keyed by
+ * the specified xid.
+ */

"is not in use" -> "is found but not in use" ?

~~~

14.

+ /* Failed to start a new parallel apply worker. */
+ if (winfo == NULL)
+ return;

There seem to be quite a lot of places (like this example) where
something may go wrong and the behaviour apparently will just silently
fall-back to using the non-parallel streaming. Maybe that is OK, but I
am just wondering how can the user ever know this has happened? Maybe
the docs can mention that this could happen and give some description
of what processes users can look for (or some other strategy) so they
can just confirm that the parallel streaming is really working like
they assume it to be?

~~~

15.

+ * Set this flag in the leader instead of the parallel apply worker to
+ * avoid the race condition where the leader has already started waiting
+ * for the parallel apply worker to finish processing the transaction(set
+ * the in_parallel_apply_xact to false) while the child process has not yet
+ * processed the first STREAM_START and has not set the
+ * in_parallel_apply_xact to true.

Missing whitespace before "("

~~~

16. src/backend/replication/logical/applyparallelworker.c -
parallel_apply_find_worker

+ /* Return the cached parallel apply worker if valid. */
+ if (stream_apply_worker != NULL)
+ return stream_apply_worker;

Perhaps 'cur_stream_parallel_apply_winfo' is a better name for this var?

~~~

17. src/backend/replication/logical/applyparallelworker.c -
parallel_apply_free_worker

+/*
+ * Remove the parallel apply worker entry from the hash table. And stop the
+ * worker if there are enough workers in the pool.
+ */
+void
+parallel_apply_free_worker(ParallelApplyWorkerInfo *winfo, TransactionId xid)

I think the reason for doing the "enough workers in the pool" logic
needs some more explanation.

~~~

18.

+ if (napplyworkers > (max_parallel_apply_workers_per_subscription / 2))
+ {
+ logicalrep_worker_stop_by_slot(winfo->shared->logicalrep_worker_slot_no,
+    winfo->shared->logicalrep_worker_generation);
+
+ ParallelApplyWorkersList = list_delete_ptr(ParallelApplyWorkersList, winfo);
+
+ shm_mq_detach(winfo->mq_handle);
+ shm_mq_detach(winfo->error_mq_handle);
+ dsm_detach(winfo->dsm_seg);
+ pfree(winfo);
+ }
+ else
+ winfo->in_use = false;

Maybe it is easier to remove this "else" and just unconditionally set
winfo->in_use = false BEFORE the check to free the entire winfo.

~~~

19. src/backend/replication/logical/applyparallelworker.c -
LogicalParallelApplyLoop

+ ApplyMessageContext = AllocSetContextCreate(ApplyContext,
+ "ApplyMessageContext",
+ ALLOCSET_DEFAULT_SIZES);

Should the name of this context be "ParallelApplyMessageContext"?

~~~

20. src/backend/replication/logical/applyparallelworker.c -
HandleParallelApplyMessage

+ default:
+ {
+ elog(ERROR, "unrecognized message type received from parallel apply
worker: %c (message length %d bytes)",
+ msgtype, msg->len);
+ }

"received from" -> "received by"

~~~


21. src/backend/replication/logical/applyparallelworker.c -
HandleParallelApplyMessages

+/*
+ * Handle any queued protocol messages received from parallel apply workers.
+ */
+void
+HandleParallelApplyMessages(void)

21a.
"received from" -> "received by"

~

21b.
I wonder if this comment should give some credit to the function in
parallel.c - because this seems almost a copy of all that code.

~~~

22. src/backend/replication/logical/applyparallelworker.c -
parallel_apply_set_xact_finish

+/*
+ * Set the in_parallel_apply_xact flag for the current parallel apply worker.
+ */
+void
+parallel_apply_set_xact_finish(void)

Should that "Set" really be saying "Reset" or "Clear"?

======

23. src/backend/replication/logical/launcher.c - logicalrep_worker_launch

+ nparallelapplyworkers = logicalrep_parallel_apply_worker_count(subid);
+
+ /*
+ * Return silently if the number of parallel apply workers reached the
+ * limit per subscription.
+ */
+ if (is_subworker && nparallelapplyworkers >=
max_parallel_apply_workers_per_subscription)
+ {
+ LWLockRelease(LogicalRepWorkerLock);
+ return false;
  }
I’m not sure if this is a good idea to be so silent. How will the user
know if they should increase the GUC parameter or not if it never
tells them that the value is too low?

~~~

24.

  /* Now wait until it attaches. */
- WaitForReplicationWorkerAttach(worker, generation, bgw_handle);
+ return WaitForReplicationWorkerAttach(worker, generation, bgw_handle);

The comment feels a tiny bit misleading, because there is a chance
that this might not attach at all and return false if something goes
wrong.

~~~

25. src/backend/replication/logical/launcher.c - logicalrep_worker_stop

+void
+logicalrep_worker_stop_by_slot(int slot_no, uint16 generation)
+{
+ LogicalRepWorker *worker = &LogicalRepCtx->workers[slot_no];
+
+ LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
+
+ /* Return if the generation doesn't match or the worker is not alive. */
+ if (worker->generation != generation ||
+ worker->proc == NULL)
+ return;
+
+ logicalrep_worker_stop_internal(worker);
+
+ LWLockRelease(LogicalRepWorkerLock);
+}

I think this condition should be changed and reversed, otherwise you
might return before releasing the lock (??)

SUGGESTION

{
LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);

/* Stop only if the worker is alive and the generation matches. */
if (worker && worker->proc && worker->generation == generation)
logicalrep_worker_stop_internal(worker);

LWLockRelease(LogicalRepWorkerLock);
}

~~~

26 src/backend/replication/logical/launcher.c - logicalrep_worker_stop_internal

+/*
+ * Workhorse for logicalrep_worker_stop() and logicalrep_worker_detach(). Stop
+ * the worker and wait for it to die.
+ */

... and logicalrep_worker_stop_by_slot()

~~~

27. src/backend/replication/logical/launcher.c - logicalrep_worker_detach

+ /*
+ * This is the leader apply worker; stop all the parallel apply workers
+ * previously started from here.
+ */
+ if (!isParallelApplyWorker(MyLogicalRepWorker))

27a.
The comment does not match the code. If this *is* the leader apply
worker then why do we have the condition to check that?

Maybe only needs a comment update like

SUGGESTION
If this is the leader apply worker then stop all the parallel...

~

27b.
Code seems also assuming it cannot be a tablesync worker but it is not
checking that. I am wondering if it will be better to have yet another
macro/inline to do isLeaderApplyWorker() that will make sure this
really is the leader apply worker. (This review comment suggestion is
repeated later below).

======

28. src/backend/replication/logical/worker.c - STREAMED TRANSACTIONS comment

+ * If no worker is available to handle the streamed transaction, the data is
+ * written to temporary files and then applied at once when the final commit
+ * arrives.

SUGGESTION
If streaming = true, or if streaming = parallel but there are not
parallel apply workers available to handle the streamed transaction,
the data is written to...

~~~

29. src/backend/replication/logical/worker.c - TransactionApplyAction

/*
 * What action to take for the transaction.
 *
 * TA_APPLY_IN_LEADER_WORKER means that we are in the leader apply worker and
 * changes of the transaction are applied directly in the worker.
 *
 * TA_SERIALIZE_TO_FILE means that we are in leader apply worker and changes
 * are written to temporary files and then applied when the final commit
 * arrives.
 *
 * TA_APPLY_IN_PARALLEL_WORKER means that we are in the parallel apply worker
 * and changes of the transaction are applied directly in the worker.
 *
 * TA_SEND_TO_PARALLEL_WORKER means that we are in the leader apply worker and
 * need to send the changes to the parallel apply worker.
 */
typedef enum
{
/* The action for non-streaming transactions. */
TA_APPLY_IN_LEADER_WORKER,

/* Actions for streaming transactions. */
TA_SERIALIZE_TO_FILE,
TA_APPLY_IN_PARALLEL_WORKER,
TA_SEND_TO_PARALLEL_WORKER
} TransactionApplyAction;

~

29a.
I think if you change all those enum names slightly (e.g. like below)
then they can be more self-explanatory:

TA_NOT_STREAMING_LEADER_APPLY
TA_STREAMING_LEADER_SERIALIZE
TA_STREAMING_LEADER_SEND_TO_PARALLEL
TA_STREAMING_PARALLEL_APPLY

~

29b.
 * TA_APPLY_IN_LEADER_WORKER means that we are in the leader apply worker and
 * changes of the transaction are applied directly in the worker.

Maybe that should mention this is for the non-streaming case, or if
you change all the enums names like in 29a. then there is no need
because it is more self-explanatory.

~~~

30. src/backend/replication/logical/worker.c - should_apply_changes_for_rel

 * Note that for streaming transactions that are being applied in parallel
+ * apply worker, we disallow applying changes on a table that is not in
+ * the READY state, because we cannot decide whether to apply the change as we
+ * won't know remote_final_lsn by that time.

"applied in parallel apply worker" -> "applied in the parallel apply worker"

~~~

31.

+ errdetail("Cannot handle streamed replication transaction by parallel "
+    "apply workers until all tables are synchronized.")));

"by parallel apply workers" -> "using parallel apply workers" (?)

~~~

32. src/backend/replication/logical/worker.c - handle_streamed_transaction

Now that there is an apply_action enum I felt it is better for this
code to be using a switch instead of all the if/else. Furthermore, it
might be better to put the switch case in a logical order (e.g. same
as the suggested enums value order of #29a).

~~~

33. src/backend/replication/logical/worker.c - apply_handle_stream_prepare

(same as comment #32)

Now that there is an apply_action enum I felt it is better for this
code to be using a switch instead of all the if/else. Furthermore, it
might be better to put the switch case in a logical order (e.g. same
as the suggested enums value order of #29a).

~~~

34. src/backend/replication/logical/worker.c - apply_handle_stream_start

(same as comment #32)

Now that there is an apply_action enum I felt it is better for this
code to be using a switch instead of all the if/else. Furthermore, it
might be better to put the switch case in a logical order (e.g. same
as the suggested enums value order of #29a).

~~~

35.

+ else if (apply_action == TA_SERIALIZE_TO_FILE)
+ {
+ /*
+ * Notify handle methods we're processing a remote in-progress
+ * transaction.
+ */
+ in_streamed_transaction = true;
+
+ /*
+ * Since no parallel apply worker is available for the first
+ * stream start, serialize all the changes of the transaction.
+ *

"Since no parallel apply worker is available".

I don't think the comment is quite correct. Maybe it is doing the
serialization because the user simply did not request to use the
parallel mode at all?

~~~

36. src/backend/replication/logical/worker.c - apply_handle_stream_stop

(same as comment #32)

Now that there is an apply_action enum I felt it is better for this
code to be using a switch instead of all the if/else. Furthermore, it
might be better to put the switch case in a logical order (e.g. same
as the suggested enums value order of #29a).

~~~

37. src/backend/replication/logical/worker.c - apply_handle_stream_abort

+ /*
+ * Check whether the publisher sends abort_lsn and abort_time.
+ *
+ * Note that the paralle apply worker is only started when the publisher
+ * sends abort_lsn and abort_time.
+ */

typo "paralle"

~~~

38.

(same as comment #32)

Now that there is an apply_action enum I felt it is better for this
code to be using a switch instead of all the if/else. Furthermore, it
might be better to put the switch case in a logical order (e.g. same
as the suggested enums value order of #29a).

~~~

39.

+ /*
+ * Set in_parallel_apply_xact to true again as we only aborted the
+ * subtransaction and the top transaction is still in progress. No
+ * need to lock here because currently only the apply leader are
+ * accessing this flag.
+ */

"are accessing" -> "is accessing"

~~~

40. src/backend/replication/logical/worker.c - apply_handle_stream_commit

(same as comment #32)

Now that there is an apply_action enum I felt it is better for this
code to be using a switch instead of all the if/else. Furthermore, it
might be better to put the switch case in a logical order (e.g. same
as the suggested enums value order of #29a).

~~~

41. src/backend/replication/logical/worker.c - store_flush_position

+ /* Skip if not the leader apply worker */
+ if (am_parallel_apply_worker())
+ return;
+

Code might be better to implement/use a new function so it can check
something like !am_leader_apply_worker()

~~~

42. src/backend/replication/logical/worker.c - InitializeApplyWorker

+/*
+ * Initialize the database connection, in-memory subscription and necessary
+ * config options.
+ */

I still think this should mention that this is common initialization
code for "both leader apply workers, and parallel apply workers"

~~~

43. src/backend/replication/logical/worker.c - ApplyWorkerMain

- /* This is main apply worker */
+ /* This is leader apply worker */

"is leader" -> "is the leader"

~~~

44. src/backend/replication/logical/worker.c - IsLogicalParallelApplyWorker

+/*
+ * Is current process a logical replication parallel apply worker?
+ */
+bool
+IsLogicalParallelApplyWorker(void)
+{
+ return am_parallel_apply_worker();
+}
+

It seems a bit strange to have this function
IsLogicalParallelApplyWorker, and also am_parallel_apply_worker()
which are basically identical except one of them is static and one is
not.

I wonder if there should be just one function. And if you really do
need 2 names for consistency then you can just define a synonym like

#define am_parallel_apply_worker IsLogicalParallelApplyWorker

~~~

45. src/backend/replication/logical/worker.c - get_transaction_apply_action

+/*
+ * Return the action to take for the given transaction. Also return the
+ * parallel apply worker information if the action is
+ * TA_SEND_TO_PARALLEL_WORKER.
+ */
+static TransactionApplyAction
+get_transaction_apply_action(TransactionId xid,
ParallelApplyWorkerInfo **winfo)

I think this should be slightly more clear to say that *winfo is
assigned to the destination parallel worker info (if the action is
TA_SEND_TO_PARALLEL_WORKER), otherwise *winfo is assigned NULL (see
also #46 below)

~~~

46.

+static TransactionApplyAction
+get_transaction_apply_action(TransactionId xid,
ParallelApplyWorkerInfo **winfo)
+{
+ if (am_parallel_apply_worker())
+ return TA_APPLY_IN_PARALLEL_WORKER;
+ else if (in_remote_transaction)
+ return TA_APPLY_IN_LEADER_WORKER;
+
+ /*
+ * Check if we are processing this transaction using a parallel apply
+ * worker and if so, send the changes to that worker.
+ */
+ else if ((*winfo = parallel_apply_find_worker(xid)))
+ return TA_SEND_TO_PARALLEL_WORKER;
+ else
+ return TA_SERIALIZE_TO_FILE;
+}

The code is a bit quirky at the moment because sometimes the *winfo
will be assigned NULL and sometimes it will be assigned valid value,
and sometimes it will still be unassigned.

I suggest always assigning it either NULL or valid.

SUGGESTIONS
static TransactionApplyAction
get_transaction_apply_action(TransactionId xid, ParallelApplyWorkerInfo **winfo)
{
*winfo = NULL; <== add this default assignment
...

======

47. src/backend/storage/ipc/procsignal.c - procsignal_sigusr1_handler

@@ -657,6 +658,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
  if (CheckProcSignal(PROCSIG_LOG_MEMORY_CONTEXT))
  HandleLogMemoryContextInterrupt();

+ if (CheckProcSignal(PROCSIG_PARALLEL_APPLY_MESSAGE))
+ HandleParallelApplyMessageInterrupt();
+

I wasn’t sure about the placement of this new code because those
CheckProcSignal don’t seem to have any particular order. I think this
belongs adjacent to the PROCSIG_PARALLEL_MESSAGE since it has the most
in common with that one.

======

48. src/backend/tcop/postgres.c

@@ -3377,6 +3377,9 @@ ProcessInterrupts(void)

  if (LogMemoryContextPending)
  ProcessLogMemoryContextInterrupt();
+
+ if (ParallelApplyMessagePending)
+ HandleParallelApplyMessages();

(like #47)

I think this belongs adjacent to the ParallelMessagePending check
since it has most in common with that one.

======

49. src/include/replication/worker_internal.h

@@ -60,6 +64,12 @@ typedef struct LogicalRepWorker
  */
  FileSet    *stream_fileset;

+ /*
+ * PID of leader apply worker if this slot is used for a parallel apply
+ * worker, InvalidPid otherwise.
+ */
+ pid_t apply_leader_pid;
+
  /* Stats. */
  XLogRecPtr last_lsn;
  TimestampTz last_send_time;
Whitespace indent of the new member ok?


~~~

50.

+typedef struct ParallelApplyWorkerShared
+{
+ slock_t mutex;
+
+ /*
+ * Flag used to ensure commit ordering.
+ *
+ * The parallel apply worker will set it to false after handling the
+ * transaction finish commands while the apply leader will wait for it to
+ * become false before proceeding in transaction finish commands (e.g.
+ * STREAM_COMMIT/STREAM_ABORT/STREAM_PREPARE).
+ */
+ bool in_parallel_apply_xact;
+
+ /* Information from the corresponding LogicalRepWorker slot. */
+ uint16 logicalrep_worker_generation;
+
+ int logicalrep_worker_slot_no;
+} ParallelApplyWorkerShared;

Whitespace indents of the new members ok?

~~~

51.

 /* Main memory context for apply worker. Permanent during worker lifetime. */
 extern PGDLLIMPORT MemoryContext ApplyContext;
+extern PGDLLIMPORT MemoryContext ApplyMessageContext;

Maybe there should be a blank line between those externs, because the
comment applies only to the first one, right? Alternatively modify the
comment.

~~~

52. src/include/replication/worker_internal.h - am_parallel_apply_worker

I thought it might be worthwhile to also add another function like
am_leader_apply_worker(). I noticed at least one place in this patch
where it could have been called.

SUGGESTION
static inline bool
am_parallel_apply_worker(void)
{
return !isParallelApplyWorker(MyLogicalRepWorker) && !am_tablesync_worker();
}

======

53. src/include/storage/procsignal.h

@@ -35,6 +35,7 @@ typedef enum
  PROCSIG_WALSND_INIT_STOPPING, /* ask walsenders to prepare for shutdown  */
  PROCSIG_BARRIER, /* global barrier interrupt  */
  PROCSIG_LOG_MEMORY_CONTEXT, /* ask backend to log the memory contexts */
+ PROCSIG_PARALLEL_APPLY_MESSAGE, /* Message from parallel apply workers */

(like #47)

I think this new enum belongs adjacent to the PROCSIG_PARALLEL_MESSAGE
since it has most in common with that one

======

54. src/tools/pgindent/typedefs.list

Missing TransactionApplyAction?

------
Kind Regards,
Peter Smith.
Fujitsu Australia

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

09 September 2022, 09:01:07

On Friday, September 9, 2022 3:02 PM Peter Smith <smithpb2250@gmail.com> wrote:
> 
> Here are my review comments for the v28-0001 patch:
> 
> (There may be some overlap with other people's review comments and/or
> some fixes already made).
> 

Thanks for the comments.


> 3.
> 
> max_logical_replication_workers (integer)
>     Specifies maximum number of logical replication workers. This
> includes apply leader workers, parallel apply workers, and table
> synchronization workers.
>     Logical replication workers are taken from the pool defined by
> max_worker_processes.
>     The default value is 4. This parameter can only be set at server start.
> 
> ~
> 
> I did not really understand why the default is 4. Because the  default
> tablesync workers is 2, and the default parallel workers is 2, but
> what about accounting for the apply worker? Therefore, shouldn't
> max_logical_replication_workers default be 5 instead of 4?

The parallel apply is disabled by default, so it's not a must to increase this
global default value as discussed[1]

[1] https://www.postgresql.org/message-id/CAD21AoCwaU8SqjmC7UkKWNjDg3Uz4FDGurMpis3zw5SEC%2B27jQ%40mail.gmail.com


> 6. src/backend/replication/logical/applyparallelworker.c
> 
> Is there a reason why this file is called applyparallelworker.c
> instead of parallelapplyworker.c? Now this name is out of step with
> names of all the new typedefs etc.

It was suggested which is consistent with the "vacuumparallel.c", but I am fine
with either name. I can change this if more people think parallelapplyworker.c
is better.


> 16. src/backend/replication/logical/applyparallelworker.c -
> parallel_apply_find_worker
> 
> + /* Return the cached parallel apply worker if valid. */
> + if (stream_apply_worker != NULL)
> + return stream_apply_worker;
> 
> Perhaps 'cur_stream_parallel_apply_winfo' is a better name for this var?

This looks a bit long to me.

>   /* Now wait until it attaches. */
> - WaitForReplicationWorkerAttach(worker, generation, bgw_handle);
> + return WaitForReplicationWorkerAttach(worker, generation, bgw_handle);
> 
> The comment feels a tiny bit misleading, because there is a chance
> that this might not attach at all and return false if something goes
> wrong.

I feel it might be better to fix this via a separate patch.


> Now that there is an apply_action enum I felt it is better for this
> code to be using a switch instead of all the if/else. Furthermore, it
> might be better to put the switch case in a logical order (e.g. same
> as the suggested enums value order of #29a).

I'm not sure whether switch case is better than if/else here. But if more
people prefer, I can change this.


> 23. src/backend/replication/logical/launcher.c - logicalrep_worker_launch
> 
> + nparallelapplyworkers = logicalrep_parallel_apply_worker_count(subid);
> +
> + /*
> + * Return silently if the number of parallel apply workers reached the
> + * limit per subscription.
> + */
> + if (is_subworker && nparallelapplyworkers >=
> max_parallel_apply_workers_per_subscription)
> + {
> + LWLockRelease(LogicalRepWorkerLock);
> + return false;
>   }
> I’m not sure if this is a good idea to be so silent. How will the user
> know if they should increase the GUC parameter or not if it never
> tells them that the value is too low ?

It's like what we do for table sync worker. Besides, I think user is
likely to intentionally limit the parallel apply worker number to leave free
workers for other purposes. And we do report a WARNING later if there is no
free worker slots errmsg("out of logical replication worker slots").


> 41. src/backend/replication/logical/worker.c - store_flush_position
> 
> + /* Skip if not the leader apply worker */
> + if (am_parallel_apply_worker())
> + return;
> +
> 
> Code might be better to implement/use a new function so it can check
> something like !am_leader_apply_worker()

Based on the existing code, both leader and table sync worker could enter this
function. Using !am_leader_apply_worker() seems will disallow table sync worker
to enter this function which might be not good although .


> 47. src/backend/storage/ipc/procsignal.c - procsignal_sigusr1_handler
> 
> @@ -657,6 +658,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
>   if (CheckProcSignal(PROCSIG_LOG_MEMORY_CONTEXT))
>   HandleLogMemoryContextInterrupt();
> 
> + if (CheckProcSignal(PROCSIG_PARALLEL_APPLY_MESSAGE))
> + HandleParallelApplyMessageInterrupt();
> +
> 
> I wasn’t sure about the placement of this new code because those
> CheckProcSignal don’t seem to have any particular order. I think this
> belongs adjacent to the PROCSIG_PARALLEL_MESSAGE since it has the most
> in common with that one.

I'm not very sure, I just followed the way we used to add new SignalReason
(e.g. add the new reason at the last but before the Recovery conflict reasons).
And the parallel apply is not very similar to parallel query in detail.


> I thought it might be worthwhile to also add another function like
> am_leader_apply_worker(). I noticed at least one place in this patch
> where it could have been called.

It seems a bit unnecessary to introduce a new macro where we already can use
am_parallel_apply_worker to check.


Best regards,
Hou zj

RE: Perform streaming logical transactions by background workers and parallel apply

From

"kuroda.hayato@fujitsu.com"

Date:

12 September 2022, 10:57:42

Dear Hou-san,

Thank you for updating the patch! Followings are comments for v28-0001.
I will dig your patch more, but I send partially to keep the activity of the thread.

===
For applyparallelworker.c

01. filename
The word-ordering of filename seems not good
because you defined the new worker as "parallel apply worker".

02. global variable

```
+/* Parallel apply workers hash table (initialized on first use). */
+static HTAB *ParallelApplyWorkersHash = NULL;
+
+/*
+ * List that stores the information of parallel apply workers that were
+ * started. Newly added worker information will be removed from the list at the
+ * end of the transaction when there are enough workers in the pool. Besides,
+ * exited workers will be removed from the list after being detected.
+ */
+static List *ParallelApplyWorkersList = NIL;
```

Could you add descriptions about difference between the list and hash table?
IIUC the Hash stores the parallel workers that
are assigned to transacitons, and the list stores all alive ones.


03. parallel_apply_find_worker

```
+       /* Return the cached parallel apply worker if valid. */
+       if (stream_apply_worker != NULL)
+               return stream_apply_worker;
```

This is just a question -
Why the given xid and the assigned xid to the worker are not checked here?
Is there chance to find wrong worker? 


04. parallel_apply_start_worker

```
+/*
+ * Start a parallel apply worker that will be used for the specified xid.
+ *
+ * If a parallel apply worker is not in use then re-use it, otherwise start a
+ * fresh one. Cache the worker information in ParallelApplyWorkersHash keyed by
+ * the specified xid.
+ */
+void
+parallel_apply_start_worker(TransactionId xid)
```

"parallel_apply_start_worker" should be "start_parallel_apply_worker", I think


05. parallel_apply_stream_abort

```
        for (i = list_length(subxactlist) - 1; i >= 0; i--)
        {
            xid = list_nth_xid(subxactlist, i);
            if (xid == subxid)
            {
                found = true;
                break;
            }
        }
```

Please not reuse the xid, declare and use another variable in the else block or something.

06. parallel_apply_free_worker

```
+       if (napplyworkers > (max_parallel_apply_workers_per_subscription / 2))
+       {
```

Please add a comment like: "Do we have enough workers in the pool?" or something.

===
For worker.c

07. general

In many lines if-else statement is used for apply_action, but I think they should rewrite as switch-case statement.

08. global variable

```
-static bool in_streamed_transaction = false;
+bool in_streamed_transaction = false;
```

a.

It seems that in_streamed_transaction is used only in the worker.c, so we can change to stati variable.

b.

That flag is set only when an apply worker spill the transaction to the disk.
How about "in_streamed_transaction" -> "in_spilled_transaction"?

09.  apply_handle_stream_prepare

```
-       elog(DEBUG1, "received prepare for streamed transaction %u", prepare_data.xid);
```

I think this debug message is still useful.

10. apply_handle_stream_stop

```
+       if (apply_action == TA_APPLY_IN_PARALLEL_WORKER)
+       {
+               pgstat_report_activity(STATE_IDLEINTRANSACTION, NULL);
+       }
+       else if (apply_action == TA_SEND_TO_PARALLEL_WORKER)
+       {
```

The ordering of the STREAM {STOP, START} is checked only when an apply worker spill the transaction to the disk.
(This is done via in_streamed_transaction)
I think checks should be added here, like if (!stream_apply_worker) or something.

11. apply_handle_stream_abort

```
+       if (in_streamed_transaction)
+               ereport(ERROR,
+                               (errcode(ERRCODE_PROTOCOL_VIOLATION),
+                                errmsg_internal("STREAM ABORT message without STREAM STOP")));
```

I think the check by stream_apply_worker should be added.

12. apply_handle_stream_commit

a.

```
    if (in_streamed_transaction)
        ereport(ERROR,
                (errcode(ERRCODE_PROTOCOL_VIOLATION),
                 errmsg_internal("STREAM COMMIT message without STREAM STOP")));
```

I think the check by stream_apply_worker should be added.

b. 

```
-       elog(DEBUG1, "received commit for streamed transaction %u", xid);
```

I think this debug message is still useful.

===
For launcher.c

13. logicalrep_worker_stop_by_slot

```
+       LogicalRepWorker *worker = &LogicalRepCtx->workers[slot_no];
+
+       LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
+
+       /* Return if the generation doesn't match or the worker is not alive. */
+       if (worker->generation != generation ||
+               worker->proc == NULL)
+               return;
+
```

a.

LWLockAcquire(LogicalRepWorkerLock) is needed before reading slots.

b. 

LWLockRelease(LogicalRepWorkerLock) is needed even if worker is not found.



Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

13 September 2022, 09:48:37

On Fri, Sep 9, 2022 at 2:31 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> On Friday, September 9, 2022 3:02 PM Peter Smith <smithpb2250@gmail.com> wrote:
> >
>
> > 3.
> >
> > max_logical_replication_workers (integer)
> >     Specifies maximum number of logical replication workers. This
> > includes apply leader workers, parallel apply workers, and table
> > synchronization workers.
> >     Logical replication workers are taken from the pool defined by
> > max_worker_processes.
> >     The default value is 4. This parameter can only be set at server start.
> >
> > ~
> >
> > I did not really understand why the default is 4. Because the  default
> > tablesync workers is 2, and the default parallel workers is 2, but
> > what about accounting for the apply worker? Therefore, shouldn't
> > max_logical_replication_workers default be 5 instead of 4?
>
> The parallel apply is disabled by default, so it's not a must to increase this
> global default value as discussed[1]
>
> [1] https://www.postgresql.org/message-id/CAD21AoCwaU8SqjmC7UkKWNjDg3Uz4FDGurMpis3zw5SEC%2B27jQ%40mail.gmail.com
>

Okay, but can we document to increase this value when the parallel
apply is enabled?

-- 
With Regards,
Amit Kapila.

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

13 September 2022, 10:25:51

On Fri, Sep 9, 2022 at 12:32 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> 29. src/backend/replication/logical/worker.c - TransactionApplyAction
>
> /*
>  * What action to take for the transaction.
>  *
>  * TA_APPLY_IN_LEADER_WORKER means that we are in the leader apply worker and
>  * changes of the transaction are applied directly in the worker.
>  *
>  * TA_SERIALIZE_TO_FILE means that we are in leader apply worker and changes
>  * are written to temporary files and then applied when the final commit
>  * arrives.
>  *
>  * TA_APPLY_IN_PARALLEL_WORKER means that we are in the parallel apply worker
>  * and changes of the transaction are applied directly in the worker.
>  *
>  * TA_SEND_TO_PARALLEL_WORKER means that we are in the leader apply worker and
>  * need to send the changes to the parallel apply worker.
>  */
> typedef enum
> {
> /* The action for non-streaming transactions. */
> TA_APPLY_IN_LEADER_WORKER,
>
> /* Actions for streaming transactions. */
> TA_SERIALIZE_TO_FILE,
> TA_APPLY_IN_PARALLEL_WORKER,
> TA_SEND_TO_PARALLEL_WORKER
> } TransactionApplyAction;
>
> ~
>
> 29a.
> I think if you change all those enum names slightly (e.g. like below)
> then they can be more self-explanatory:
>
> TA_NOT_STREAMING_LEADER_APPLY
> TA_STREAMING_LEADER_SERIALIZE
> TA_STREAMING_LEADER_SEND_TO_PARALLEL
> TA_STREAMING_PARALLEL_APPLY
>
> ~
>

I also think we can improve naming but adding streaming in the names
makes them slightly difficult to read. As you have suggested, it will
be better to add comments for streaming and non-streaming cases. How
about naming them as below:

typedef enum
{
TRANS_LEADER_APPLY
TRANS_LEADER_SERIALIZE
TRANS_LEADER_SEND_TO_PARALLEL
TRANS_PARALLEL_APPLY
} TransApplyAction;

-- 
With Regards,
Amit Kapila.

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

13 September 2022, 10:49:59

On Mon, Sep 12, 2022 at 4:27 PM kuroda.hayato@fujitsu.com
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Hou-san,
>
> Thank you for updating the patch! Followings are comments for v28-0001.
> I will dig your patch more, but I send partially to keep the activity of the thread.
>
> ===
> For applyparallelworker.c
>
> 01. filename
> The word-ordering of filename seems not good
> because you defined the new worker as "parallel apply worker".
>

I think in the future we may have more files for apply work (like
applyddl.c for DDL apply work), so it seems okay to name all apply
related files in a similar way.

>
> ===
> For worker.c
>
> 07. general
>
> In many lines if-else statement is used for apply_action, but I think they should rewrite as switch-case statement.
>

Sounds reasonable to me.

> 08. global variable
>
> ```
> -static bool in_streamed_transaction = false;
> +bool in_streamed_transaction = false;
> ```
>
> a.
>
> It seems that in_streamed_transaction is used only in the worker.c, so we can change to stati variable.
>

Yeah, I don't know why it has been changed in the first place.

> b.
>
> That flag is set only when an apply worker spill the transaction to the disk.
> How about "in_streamed_transaction" -> "in_spilled_transaction"?
>

Isn't this an existing variable? If so, it doesn't seem like a good
idea to change the name unless we are changing its meaning.

-- 
With Regards,
Amit Kapila.

RE: Perform streaming logical transactions by background workers and parallel apply

From

"kuroda.hayato@fujitsu.com"

Date:

13 September 2022, 12:02:26

Dear Hou-san,

> I will dig your patch more, but I send partially to keep the activity of the thread.

More minor comments about v28.

===
About 0002 

For 015_stream.pl

14. check_parallel_log

```
+# Check the log that the streamed transaction was completed successfully
+# reported by parallel apply worker.
+sub check_parallel_log
+{
+       my ($node_subscriber, $offset, $is_parallel)= @_;
+       my $parallel_message = 'finished processing the transaction finish command';
+
+       if ($is_parallel)
+       {
+               $node_subscriber->wait_for_log(qr/$parallel_message/, $offset);
+       }
+}
```

I think check_parallel_log() should be called only when streaming = 'parallel' and if-statement is not needed

===
For 016_stream_subxact.pl

15. test_streaming

```
+       INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
```

"    3" should be "3".

===
About 0003

For applyparallelworker.c

16. parallel_apply_relation_check()

```
+       if (rel->parallel_apply_safe == PARALLEL_APPLY_SAFETY_UNKNOWN)
+               logicalrep_rel_mark_parallel_apply(rel);
```

I was not clear when logicalrep_rel_mark_parallel_apply() is called here.
IIUC parallel_apply_relation_check() is called when parallel apply worker handles changes,
but before that relation is opened via logicalrep_rel_open() and parallel_apply_safe is set here.
If it guards some protocol violation, we may use Assert().

===
For create_subscription.sgml

17.
The restriction about foreign key does not seem to be documented.

===
About 0004

For 015_stream.pl

18. check_parallel_log

I heard that the removal has been reverted, but in the patch
check_parallel_log() is removed again... :-(


===
About throughout

I checked the test coverage via `make coverage`. About appluparallelworker.c and worker.c, both function coverage is
100%,and
 
line coverages are 86.2 % and 94.5 %. Generally it's good.
But I read the report and following parts seems not tested.

In parallel_apply_start_worker():

```
        if (tmp_winfo->error_mq_handle == NULL)
        {
            /*
             * Release the worker information and try next one if the parallel
             * apply worker exited cleanly.
             */
            ParallelApplyWorkersList = foreach_delete_current(ParallelApplyWorkersList, lc);
            shm_mq_detach(tmp_winfo->mq_handle);
            dsm_detach(tmp_winfo->dsm_seg);
            pfree(tmp_winfo);
        }
```

In HandleParallelApplyMessage():

```
        case 'X':                /* Terminate, indicating clean exit */
            {
                shm_mq_detach(winfo->error_mq_handle);
                winfo->error_mq_handle = NULL;
                break;
            }
```

Does it mean that we do not test the termination of parallel apply worker? If so I think it should be tested.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

RE: Perform streaming logical transactions by background workers and parallel apply

From

"kuroda.hayato@fujitsu.com"

Date:

13 September 2022, 12:05:32

Hi,

> > 01. filename
> > The word-ordering of filename seems not good
> > because you defined the new worker as "parallel apply worker".
> >
> 
> I think in the future we may have more files for apply work (like
> applyddl.c for DDL apply work), so it seems okay to name all apply
> related files in a similar way.

> > That flag is set only when an apply worker spill the transaction to the disk.
> > How about "in_streamed_transaction" -> "in_spilled_transaction"?
> >
> 
> Isn't this an existing variable? If so, it doesn't seem like a good
> idea to change the name unless we are changing its meaning.

Both of you said are reasonable. They do not have to be modified.


Best Regards,
Hayato Kuroda
FUJITSU LIMITED

RE: Perform streaming logical transactions by background workers and parallel apply

From

"wangw.fnst@fujitsu.com"

Date:

15 September 2022, 05:15:24

On Thur, Sep 8, 2022 at 14:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Sep 5, 2022 at 6:34 PM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Attach the correct patch set this time.
> >
> 
> Few comments on v28-0001*:

Thanks for your comments.

> 1.
> + /* Whether the worker is processing a transaction. */
> + bool in_use;
> 
> I think this same comment applies to in_parallel_apply_xact flag as
> well. How about: "Indicates whether the worker is available to be used
> for parallel apply transaction?"?
> 
> 2.
> + /*
> + * Set this flag in the leader instead of the parallel apply worker to
> + * avoid the race condition where the leader has already started waiting
> + * for the parallel apply worker to finish processing the transaction(set
> + * the in_parallel_apply_xact to false) while the child process has not yet
> + * processed the first STREAM_START and has not set the
> + * in_parallel_apply_xact to true.
> 
> I think part of this comment "(set the in_parallel_apply_xact to
> false)" is not necessary. It will be clear without that.
> 
> 3.
> + /* Create entry for requested transaction. */
> + entry = hash_search(ParallelApplyWorkersHash, &xid, HASH_ENTER, &found);
> + if (found)
> + elog(ERROR, "hash table corrupted");
> ...
> ...
> + hash_search(ParallelApplyWorkersHash, &xid, HASH_REMOVE, NULL);
> 
> It is better to have a similar elog for HASH_REMOVE case as well. We
> normally seem to have such elog for HASH_REMOVE.
> 
> 4.
> * Parallel apply is not supported when subscribing to a publisher which
> +     * cannot provide the abort_time, abort_lsn and the column information
> used
> +     * to verify the parallel apply safety.
> 
> 
> In this comment, which column information are you referring to?
> 
> 5.
> + /*
> + * Set in_parallel_apply_xact to true again as we only aborted the
> + * subtransaction and the top transaction is still in progress. No
> + * need to lock here because currently only the apply leader are
> + * accessing this flag.
> + */
> + winfo->shared->in_parallel_apply_xact = true;
> 
> This theory sounds good to me but I think it is better to update/read
> this flag under spinlock as the patch is doing at a few other places.
> I think that will make the code easier to follow without worrying too
> much about such special cases. There are a few asserts as well which
> read this without lock, it would be better to change those as well.
> 
> 6.
> + * LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM is the minimum
> protocol version
> + * with support for streaming large transactions using parallel apply
> + * workers. Introduced in PG16.
> 
> How about changing it to something like:
> "LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM is the minimum
> protocol
> version where we support applying large streaming transactions in
> parallel. Introduced in PG16."
> 
> 7.
> + PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
> + bool write_abort_lsn = (data->protocol_version >=
> + LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM);
> 
>   /*
>   * The abort should happen outside streaming block, even for streamed
> @@ -1856,7 +1859,8 @@ pgoutput_stream_abort(struct
> LogicalDecodingContext *ctx,
>   Assert(rbtxn_is_streamed(toptxn));
> 
>   OutputPluginPrepareWrite(ctx, true);
> - logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
> + logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn, abort_lsn,
> +   write_abort_lsn);
> 
> I think we need to send additional information if the client has used
> the parallel streaming option. Also, let's keep sending subxid as we
> were doing previously and add additional parameters required. It may
> be better to name write_abort_lsn as abort_info.
> 
> 8.
> + /*
> + * Check whether the publisher sends abort_lsn and abort_time.
> + *
> + * Note that the paralle apply worker is only started when the publisher
> + * sends abort_lsn and abort_time.
> + */
> + if (am_parallel_apply_worker() ||
> + walrcv_server_version(LogRepWorkerWalRcvConn) >= 160000)
> + read_abort_lsn = true;
> +
> + logicalrep_read_stream_abort(s, &abort_data, read_abort_lsn);
> 
> This check should match with the check for the write operation where
> we are checking the protocol version as well. There is a typo as well
> in the comments (/paralle/parallel).

Improved as suggested.

Attach the new patch set.

Regards,
Wang wei

On Thu, Sep 15, 2022 1:15 PM Wang, Wei/王 威 <wangw.fnst@fujitsu.com> wrote:
> 
> Attach the new patch set.
> 

Hi,

I did some performance tests for "rollback to savepoint" cases, based on v28
patch.

This test used synchronous logical replication, and compared SQL execution times
before and after applying the patch. It tested different percentage of changes
in the transaction are rolled back (use "rollback to savepoint"), when using
different logical_decoding_work_mem.

The test was performed ten times, and the average of the middle eight was taken.

The results are as follows. The bar charts and the scripts of the test are
attached. The steps to reproduce performance test are at the beginning of
`start_pub.sh`.

RESULT - rollback 10% (5kk)
---------------------------------------------------------------
logical_decoding_work_mem   64kB        256kB       64MB
HEAD                        43.752      43.463      42.667
patched                     32.646      30.941      31.491
Compare with HEAD           -25.39%     -28.81%     -26.19%

RESULT - rollback 20% (5kk)
---------------------------------------------------------------
logical_decoding_work_mem   64kB        256kB       64MB
HEAD                        40.974      40.214      39.930
patched                     28.114      28.055      27.550
Compare with HEAD           -31.39%     -30.23%     -31.00%

RESULT - rollback 30% (5kk)
---------------------------------------------------------------
logical_decoding_work_mem   64kB        256kB       64MB
HEAD                        37.648      37.785      36.969
patched                     29.554      29.389      27.398
Compare with HEAD           -21.50%     -22.22%     -25.89%

RESULT - rollback 50% (5kk)
---------------------------------------------------------------
logical_decoding_work_mem   64kB        256kB       64MB
HEAD                        32.312      32.201      32.533
patched                     30.238      30.244      27.903
Compare with HEAD           -6.42%      -6.08%      -14.23%

(If "Compare with HEAD" is a positive number, it means worse than HEAD; if it is
a negative number, it means better than HEAD.)

Summary:
In general, when using "rollback to savepoint", the more the amount of data we
need to rollback, the smaller the improvement compared to HEAD. But as such
cases won't be often, this should be okay.

Regards,
Shi yu

Attachment

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

15 September 2022, 11:39:45

On Thu, Sep 15, 2022 at 10:45 AM wangw.fnst@fujitsu.com
<wangw.fnst@fujitsu.com> wrote:
>
> Attach the new patch set.
>

Review of v29-0001*
==================
1.
+parallel_apply_find_worker(TransactionId xid)
{
...
+ entry = hash_search(ParallelApplyWorkersHash, &xid, HASH_FIND, &found);
+ if (found)
+ {
+ /* If any workers (or the postmaster) have died, we have failed. */
+ if (entry->winfo->error_mq_handle == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("lost connection to parallel apply worker")));
...
}

I think the above comment is incorrect because if the postmaster would
have died then you wouldn't have found the entry in the hash table.
How about something like: "We can't proceed if the parallel streaming
worker has already exited."

2.
+/*
+ * Find the previously assigned worker for the given transaction, if any.
+ */
+ParallelApplyWorkerInfo *
+parallel_apply_find_worker(TransactionId xid)

No need to use word 'previously' in the above sentence.

3.
+ * We need one key to register the location of the header, and we need
+ * another key to track the location of the message queue.
+ */
+ shm_toc_initialize_estimator(&e);
+ shm_toc_estimate_chunk(&e, sizeof(ParallelApplyWorkerShared));
+ shm_toc_estimate_chunk(&e, queue_size);
+ shm_toc_estimate_chunk(&e, error_queue_size);
+
+ shm_toc_estimate_keys(&e, 3);

Overall, three keys are used but the comment indicates two. You forgot
to mention about error_queue.

4.
+ if (launched)
+ ParallelApplyWorkersList = lappend(ParallelApplyWorkersList, winfo);
+ else
+ {
+ shm_mq_detach(winfo->mq_handle);
+ shm_mq_detach(winfo->error_mq_handle);
+ dsm_detach(winfo->dsm_seg);
+ pfree(winfo);
+
+ winfo = NULL;
+ }

A. The code used in the else part to free worker info is the same as
what is used in parallel_apply_free_worker. Can we move this to a
separate function say parallel_apply_free_worker_info()?
B. I think it will be better if you use {} for if branch to make it
look consistent with else branch.

5.
+ * case define a named savepoint, so that we are able to commit/rollback it
+ * separately later.
+ */
+void
+parallel_apply_subxact_info_add(TransactionId current_xid)

I don't see the need of commit in the above message. So, we can
slightly modify it to: "... so that we are able to rollback to it
separately later."

6.
+ for (i = list_length(subxactlist) - 1; i >= 0; i--)
+ {
+ xid = list_nth_xid(subxactlist, i);
...
...

+/*
+ * Return the TransactionId value contained in the n'th element of the
+ * specified list.
+ */
+static inline TransactionId
+list_nth_xid(const List *list, int n)
+{
+ Assert(IsA(list, XidList));
+ return lfirst_xid(list_nth_cell(list, n));
+}

I am not really sure that we need a new list function to use for this
place. Can't we directly use lfirst_xid(list_nth_cell) instead?

7.
+void
+parallel_apply_replorigin_setup(void)
+{
+ RepOriginId originid;
+ char originname[NAMEDATALEN];
+ bool started_tx = false;
+
+ /* This function might be called inside or outside of transaction. */
+ if (!IsTransactionState())
+ {
+ StartTransactionCommand();
+ started_tx = true;
+ }

Is there a place in the patch where this function will be called
without having an active transaction state? If so, then this coding is
fine but if not, then I suggest keeping an assert for transaction
state here. The same thing applies to
parallel_apply_replorigin_reset() as well.

8.
+ *
+ * If write_abort_lsn is true, send the abort_lsn and abort_time fields,
+ * otherwise don't.
  */
 void
 logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
-   TransactionId subxid)
+   TransactionId subxid, XLogRecPtr abort_lsn,
+   TimestampTz abort_time, bool abort_info)

In the comment, the name of the variable needs to be updated.

9.
+TransactionId stream_xid = InvalidTransactionId;

-static TransactionId stream_xid = InvalidTransactionId;
...
...
+void
+parallel_apply_subxact_info_add(TransactionId current_xid)
+{
+ if (current_xid != stream_xid &&
+ !list_member_xid(subxactlist, current_xid))

It seems you have changed the scope of stream_xid to use it in
parallel_apply_subxact_info_add(). Won't it be better to pass it as a
parameter (say top_xid)?

10.
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -20,6 +20,7 @@
 #include <sys/time.h>

 #include "access/xlog.h"
+#include "catalog/pg_subscription.h"
 #include "catalog/pg_type.h"
 #include "common/connect.h"
 #include "funcapi.h"
@@ -443,9 +444,14 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
  appendStringInfo(&cmd, "proto_version '%u'",
  options->proto.logical.proto_version);

- if (options->proto.logical.streaming &&
- PQserverVersion(conn->streamConn) >= 140000)
- appendStringInfoString(&cmd, ", streaming 'on'");
+ if (options->proto.logical.streaming != SUBSTREAM_OFF)
+ {
+ if (PQserverVersion(conn->streamConn) >= 160000 &&
+ options->proto.logical.streaming == SUBSTREAM_PARALLEL)
+ appendStringInfoString(&cmd, ", streaming 'parallel'");
+ else if (PQserverVersion(conn->streamConn) >= 140000)
+ appendStringInfoString(&cmd, ", streaming 'on'");
+ }

It doesn't seem like a good idea to expose subscription options here.
Can we think of having char *streaming_option instead of the current
streaming parameter which is filled by the caller and used here
directly?

11. The error message used in pgoutput_startup() seems to be better
than the current messages used in that function but it is better to be
consistent with other messages. There is a discussion in the email
thread [1] on improving those messages, so kindly suggest there.

12. In addition to the above, I have changed/added a few comments in
the attached patch.

[1] - https://www.postgresql.org/message-id/20220914.111507.13049297635620898.horikyota.ntt%40gmail.com

-- 
With Regards,
Amit Kapila.

Attachment

changed_comments_amit_v29.patch

RE: Perform streaming logical transactions by background workers and parallel apply

From

"wangw.fnst@fujitsu.com"

Date:

19 September 2022, 03:25:31

On Thu, Sep 15, 2022 at 19:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Sep 15, 2022 at 10:45 AM wangw.fnst@fujitsu.com
> <wangw.fnst@fujitsu.com> wrote:
> >
> > Attach the new patch set.
> >
> 
> Review of v29-0001*

Thanks for your comments and patch!

> ==================
> 1.
> +parallel_apply_find_worker(TransactionId xid)
> {
> ...
> + entry = hash_search(ParallelApplyWorkersHash, &xid, HASH_FIND, &found);
> + if (found)
> + {
> + /* If any workers (or the postmaster) have died, we have failed. */
> + if (entry->winfo->error_mq_handle == NULL)
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("lost connection to parallel apply worker")));
> ...
> }
> 
> I think the above comment is incorrect because if the postmaster would
> have died then you wouldn't have found the entry in the hash table.
> How about something like: "We can't proceed if the parallel streaming
> worker has already exited."

Fixed.

> 2.
> +/*
> + * Find the previously assigned worker for the given transaction, if any.
> + */
> +ParallelApplyWorkerInfo *
> +parallel_apply_find_worker(TransactionId xid)
> 
> No need to use word 'previously' in the above sentence.

Improved.

> 3.
> + * We need one key to register the location of the header, and we need
> + * another key to track the location of the message queue.
> + */
> + shm_toc_initialize_estimator(&e);
> + shm_toc_estimate_chunk(&e, sizeof(ParallelApplyWorkerShared));
> + shm_toc_estimate_chunk(&e, queue_size);
> + shm_toc_estimate_chunk(&e, error_queue_size);
> +
> + shm_toc_estimate_keys(&e, 3);
> 
> Overall, three keys are used but the comment indicates two. You forgot
> to mention about error_queue.

Fixed.

> 4.
> + if (launched)
> + ParallelApplyWorkersList = lappend(ParallelApplyWorkersList, winfo);
> + else
> + {
> + shm_mq_detach(winfo->mq_handle);
> + shm_mq_detach(winfo->error_mq_handle);
> + dsm_detach(winfo->dsm_seg);
> + pfree(winfo);
> +
> + winfo = NULL;
> + }
> 
> A. The code used in the else part to free worker info is the same as
> what is used in parallel_apply_free_worker. Can we move this to a
> separate function say parallel_apply_free_worker_info()?
> B. I think it will be better if you use {} for if branch to make it
> look consistent with else branch.

Improved.

> 5.
> + * case define a named savepoint, so that we are able to commit/rollback it
> + * separately later.
> + */
> +void
> +parallel_apply_subxact_info_add(TransactionId current_xid)
> 
> I don't see the need of commit in the above message. So, we can
> slightly modify it to: "... so that we are able to rollback to it
> separately later."

Improved.

> 6.
> + for (i = list_length(subxactlist) - 1; i >= 0; i--)
> + {
> + xid = list_nth_xid(subxactlist, i);
> ...
> ...
> 
> +/*
> + * Return the TransactionId value contained in the n'th element of the
> + * specified list.
> + */
> +static inline TransactionId
> +list_nth_xid(const List *list, int n)
> +{
> + Assert(IsA(list, XidList));
> + return lfirst_xid(list_nth_cell(list, n));
> +}
> 
> I am not really sure that we need a new list function to use for this
> place. Can't we directly use lfirst_xid(list_nth_cell) instead?

Improved.
 
> 7.
> +void
> +parallel_apply_replorigin_setup(void)
> +{
> + RepOriginId originid;
> + char originname[NAMEDATALEN];
> + bool started_tx = false;
> +
> + /* This function might be called inside or outside of transaction. */
> + if (!IsTransactionState())
> + {
> + StartTransactionCommand();
> + started_tx = true;
> + }
> 
> Is there a place in the patch where this function will be called
> without having an active transaction state? If so, then this coding is
> fine but if not, then I suggest keeping an assert for transaction
> state here. The same thing applies to
> parallel_apply_replorigin_reset() as well.

When using parallel apply, only the parallel apply worker is in a transaction
while the leader apply worker is not. So when invoking function
parallel_apply_replorigin_setup() in the leader apply worker, we need to start
a transaction block.

> 8.
> + *
> + * If write_abort_lsn is true, send the abort_lsn and abort_time fields,
> + * otherwise don't.
>   */
>  void
>  logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
> -   TransactionId subxid)
> +   TransactionId subxid, XLogRecPtr abort_lsn,
> +   TimestampTz abort_time, bool abort_info)
> 
> In the comment, the name of the variable needs to be updated.

Fixed.

> 9.
> +TransactionId stream_xid = InvalidTransactionId;
> 
> -static TransactionId stream_xid = InvalidTransactionId;
> ...
> ...
> +void
> +parallel_apply_subxact_info_add(TransactionId current_xid)
> +{
> + if (current_xid != stream_xid &&
> + !list_member_xid(subxactlist, current_xid))
> 
> It seems you have changed the scope of stream_xid to use it in
> parallel_apply_subxact_info_add(). Won't it be better to pass it as a
> parameter (say top_xid)?

Improved.

> 10.
> --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
> +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
> @@ -20,6 +20,7 @@
>  #include <sys/time.h>
> 
>  #include "access/xlog.h"
> +#include "catalog/pg_subscription.h"
>  #include "catalog/pg_type.h"
>  #include "common/connect.h"
>  #include "funcapi.h"
> @@ -443,9 +444,14 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
>   appendStringInfo(&cmd, "proto_version '%u'",
>   options->proto.logical.proto_version);
> 
> - if (options->proto.logical.streaming &&
> - PQserverVersion(conn->streamConn) >= 140000)
> - appendStringInfoString(&cmd, ", streaming 'on'");
> + if (options->proto.logical.streaming != SUBSTREAM_OFF)
> + {
> + if (PQserverVersion(conn->streamConn) >= 160000 &&
> + options->proto.logical.streaming == SUBSTREAM_PARALLEL)
> + appendStringInfoString(&cmd, ", streaming 'parallel'");
> + else if (PQserverVersion(conn->streamConn) >= 140000)
> + appendStringInfoString(&cmd, ", streaming 'on'");
> + }
> 
> It doesn't seem like a good idea to expose subscription options here.
> Can we think of having char *streaming_option instead of the current
> streaming parameter which is filled by the caller and used here
> directly?

Improved.

> 11. The error message used in pgoutput_startup() seems to be better
> than the current messages used in that function but it is better to be
> consistent with other messages. There is a discussion in the email
> thread [1] on improving those messages, so kindly suggest there.

Okay, I will try to modify the two messages and share them in the thread you
mentioned.

> 12. In addition to the above, I have changed/added a few comments in
> the attached patch.

Improved as suggested.

Regards,
Wang wei

On Tues, Sep 20, 2022 at 11:41 AM Shi, Yu/侍 雨 <shiy.fnst@cn.fujitsu.com> wrote:
> On Mon, Sept 19, 2022 11:26 AM Wang, Wei/王 威 <wangw.fnst@fujitsu.com>
> wrote:
> >
> >
> > Improved as suggested.
> >
> 
> Thanks for updating the patch. Here are some comments on 0001 patch.

Thanks for your comments.

> 1.
> +        case TRANS_LEADER_SERIALIZE:
> 
> -        oldctx = MemoryContextSwitchTo(ApplyContext);
> +            /*
> +             * Notify handle methods we're processing a remote in-
> progress
> +             * transaction.
> +             */
> +            in_streamed_transaction = true;
> 
> -        MyLogicalRepWorker->stream_fileset = palloc(sizeof(FileSet));
> -        FileSetInit(MyLogicalRepWorker->stream_fileset);
> +            /*
> +             * Since no parallel apply worker is used for the first
> stream
> +             * start, serialize all the changes of the transaction.
> +             *
> +             * Start a transaction on stream start, this transaction will
> be
> 
> 
> It seems that the following comment can be removed after using switch case.
> +             * Since no parallel apply worker is used for the first
> stream
> +             * start, serialize all the changes of the transaction.

Removed.

> 2.
> +    switch (apply_action)
> +    {
> +        case TRANS_LEADER_SERIALIZE:
> +            if (!in_streamed_transaction)
> +                ereport(ERROR,
> +
>     (errcode(ERRCODE_PROTOCOL_VIOLATION),
> +                         errmsg_internal("STREAM STOP
> message without STREAM START")));
> 
> In apply_handle_stream_stop(), I think we can move this check to the beginning
> of
> this function, to be consistent to other functions.

Improved as suggested.

> 3. I think the some of the changes in 0005 patch can be merged to 0001 patch,
> 0005 patch can only contain the changes about new column 'apply_leader_pid'.

Merged changes not related to 'apply_leader_pid' into patch 0001.

> 4.
> + * ParallelApplyWorkersList. After successfully, launching a new worker it's
> + * information is added to the ParallelApplyWorkersList. Once the worker
> 
> Should `it's` be `its` ?

Fixed.

Attach the new patch set.

Regards,
Wang wei

Attachment

RE: Perform streaming logical transactions by background workers and parallel apply

From

"wangw.fnst@fujitsu.com"

Date:

21 September 2022, 02:14:35

> FYI -
> 
> The latest patch 30-0001 fails to apply, it seems due to a recent commit [1].
> 
> [postgres@CentOS7-x64 oss_postgres_misc]$ git apply
> ../patches_misc/v30-0001-Perform-streaming-logical-transactions-by-
> parall.patch
> error: patch failed: src/include/replication/logicalproto.h:246
> error: src/include/replication/logicalproto.h: patch does not apply

Thanks for your kindly reminder.

I rebased the patch set and attached them in [1].

[1] -
https://www.postgresql.org/message-id/OS3PR01MB6275298521AE1BBEF5A055EE9E4F9%40OS3PR01MB6275.jpnprd01.prod.outlook.com

Regards,
Wang wei

RE: Perform streaming logical transactions by background workers and parallel apply

From

"wangw.fnst@fujitsu.com"

Date:

21 September 2022, 08:27:58

On Wed, Sep 21, 2022 at 10:09 AM Wang, Wei/王 威 <wangw.fnst@fujitsu.com> wrote:
> Attach the new patch set.

Because of the changes in HEAD (a932824), the patch set could not be applied
cleanly, so I rebase them.

Attach the new patch set.

Regards,
Wang wei

On Wed, Sep 21, 2022 at 17:25 PM Peter Smith <smithpb2250@gmail.com> wrote:
> Here are some review comments for patch v30-0001.

Thanks for your comments.

> ======
> 
> 1. Commit message
> 
> In addition, the patch extends the logical replication STREAM_ABORT message
> so
> that abort_time and abort_lsn can also be sent which can be used to update the
> replication origin in parallel apply worker when the streaming transaction is
> aborted. Because this message extension is needed to support parallel
> streaming, meaning that parallel streaming is not supported for publications on
> servers < PG16.
> 
> "meaning that parallel streaming is not supported" -> "parallel
> streaming is not supported"

Improved as suggested.

> ======
> 
> 2. doc/src/sgml/logical-replication.sgml
> 
> @@ -1611,8 +1622,12 @@ CONTEXT:  processing remote data for
> replication origin "pg_16395" during "INSER
>     to the subscriber, plus some reserve for table synchronization.
>     <varname>max_logical_replication_workers</varname> must be set to at
> least
>     the number of subscriptions, again plus some reserve for the table
> -   synchronization.  Additionally the
> <varname>max_worker_processes</varname>
> -   may need to be adjusted to accommodate for replication workers, at least
> +   synchronization. In addition, if the subscription parameter
> +   <literal>streaming</literal> is set to <literal>parallel</literal>, please
> +   increase <literal>max_logical_replication_workers</literal> according to
> +   the desired number of parallel apply workers.  Additionally the
> +   <varname>max_worker_processes</varname> may need to be adjusted to
> +   accommodate for replication workers, at least
>     (<varname>max_logical_replication_workers</varname>
>     + <literal>1</literal>).  Note that some extensions and parallel queries
>     also take worker slots from <varname>max_worker_processes</varname>.
> 
> IMO it looks a bit strange to have "In addition" followed by "Additionally".
> 
> Also, "to accommodate for replication workers"? seems like a typo (but
> it is not caused by your patch)
> 
> BEFORE
> In addition, if the subscription parameter streaming is set to
> parallel, please increase max_logical_replication_workers according to
> the desired number of parallel apply workers.
> 
> AFTER (???)
> If the subscription parameter streaming is set to parallel,
> max_logical_replication_workers should be increased according to the
> desired number of parallel apply workers.

=> Reword
Improved as suggested.

=> typo?
Sorry, I am not sure. Do you mean
s/replication workers/workers for subscriptions/  or something else?
I think we should improve it in a new thread.

> ======
> 
> 4. .../replication/logical/applyparallelworker.c - parallel_apply_free_worker
> 
> + winfo->in_use = false;
> +
> + /* Are there enough workers in the pool? */
> + if (napplyworkers > (max_parallel_apply_workers_per_subscription / 2))
> + {
> 
> I felt the comment/logic about "enough" needs a bit more description.
> At least it should say to refer to the more detailed explanation atop
> worker.c

Added related comment atop this function.

> ======
> 
> 5. .../replication/logical/applyparallelworker.c - parallel_apply_setup_dsm
> 
> + /*
> + * Estimate how much shared memory we need.
> + *
> + * Because the TOC machinery may choose to insert padding of oddly-sized
> + * requests, we must estimate each chunk separately.
> + *
> + * We need one key to register the location of the header, and we need two
> + * other keys to track of the locations of the message queue and the error
> + * message queue.
> + */
> 
> "track of" -> "keep track of" ?

Improved.

> ======
> 
> 6. src/backend/replication/logical/launcher.c  - logicalrep_worker_detach
> 
>  logicalrep_worker_detach(void)
>  {
> + /* Stop the parallel apply workers. */
> + if (!am_parallel_apply_worker() && !am_tablesync_worker())
> + {
> + List    *workers;
> + ListCell   *lc;
> 
> The condition is not very obvious. This is why I previously suggested
> adding another macro/function like 'isLeaderApplyWorker'. In the
> absence of that, then I think the comment needs to be more
> descriptive.
> 
> SUGGESTION
> If this is the leader apply worker then stop the parallel apply workers.

Added the new function am_leader_apply_worker.

> ======
> 
> 7. src/backend/replication/logical/proto.c - logicalrep_read_stream_abort
> 
>  void
>  logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
> -   TransactionId subxid)
> +   TransactionId subxid, XLogRecPtr abort_lsn,
> +   TimestampTz abort_time, bool abort_info)
>  {
>   pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_ABORT);
> 
> @@ -1175,19 +1179,40 @@ logicalrep_write_stream_abort(StringInfo out,
> TransactionId xid,
>   /* transaction ID */
>   pq_sendint32(out, xid);
>   pq_sendint32(out, subxid);
> +
> + if (abort_info)
> + {
> + pq_sendint64(out, abort_lsn);
> + pq_sendint64(out, abort_time);
> + }
> 
> 
> The new param name 'abort_info' seems misleading.
> 
> Maybe a name like 'write_abort_info' is better?

Improved as suggested.

> ~~~
> 
> 8. src/backend/replication/logical/proto.c - logicalrep_read_stream_abort
> 
> +logicalrep_read_stream_abort(StringInfo in,
> + LogicalRepStreamAbortData *abort_data,
> + bool read_abort_lsn)
>  {
> - Assert(xid && subxid);
> + Assert(abort_data);
> +
> + abort_data->xid = pq_getmsgint(in, 4);
> + abort_data->subxid = pq_getmsgint(in, 4);
> 
> - *xid = pq_getmsgint(in, 4);
> - *subxid = pq_getmsgint(in, 4);
> + if (read_abort_lsn)
> + {
> + abort_data->abort_lsn = pq_getmsgint64(in);
> + abort_data->abort_time = pq_getmsgint64(in);
> + }
> 
> This name 'read_abort_lsn' is inconsistent with the 'abort_info' of
> the logicalrep_write_stream_abort.
> 
> I suggest change these to 'read_abort_info/write_abort_info'

Improved as suggested.

> ======
> 
> 9. src/backend/replication/logical/worker.c - file header comment
> 
> + * information is added to the ParallelApplyWorkersList. Once the worker
> + * finishes applying the transaction, we mark it available for use. Now,
> + * before starting a new worker to apply the streaming transaction, we check
> + * the list and use any worker, if available. Note that we maintain a maximum
> 
> 9a.
> "available for use." -> "available for re-use."
> 
> ~
> 
> 9b.
> "we check the list and use any worker, if available" -> "we check the
> list for any available worker"

Improved as suggested.

> ~~~
> 
> 10. src/backend/replication/logical/worker.c - handle_streamed_transaction
> 
> + /* write the change to the current file */
> + stream_write_change(action, s);
> + return true;
> 
> Uppercase the comment.

Improved as suggested.

> ~~~
> 
> 11. src/backend/replication/logical/worker.c - apply_handle_stream_abort
> 
> +static void
> +apply_handle_stream_abort(StringInfo s)
> +{
> + TransactionId xid;
> + TransactionId subxid;
> + LogicalRepStreamAbortData abort_data;
> + bool read_abort_lsn = false;
> + ParallelApplyWorkerInfo *winfo = NULL;
> + TransApplyAction apply_action;
> 
> The variable 'read_abort_lsn' name ought to be changed to match
> consistently the parameter name.

Improved as suggested.

> ======
> 
> 12. src/backend/replication/pgoutput/pgoutput.c - pgoutput_stream_abort
> 
> @@ -1843,6 +1850,8 @@ pgoutput_stream_abort(struct
> LogicalDecodingContext *ctx,
>     XLogRecPtr abort_lsn)
>  {
>   ReorderBufferTXN *toptxn;
> + PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
> + bool abort_info = (data->streaming == SUBSTREAM_PARALLEL);
> 
> The variable 'abort_info' name ought to be changed to be
> 'write_abort_info' (as suggested above) to match consistently the
> parameter name.

Improved as suggested.

Attach the new patch set.

Regards,
Wang wei

On Thu, Sep 22, 2022 at 3:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Sep 22, 2022 at 8:59 AM wangw.fnst@fujitsu.com
> <wangw.fnst@fujitsu.com> wrote:
> >
>
> Few comments on v33-0001
> =======================
>

Some more comments on v33-0001
=============================
1.
+ /* Information from the corresponding LogicalRepWorker slot. */
+ uint16 logicalrep_worker_generation;
+
+ int logicalrep_worker_slot_no;
+} ParallelApplyWorkerShared;

Both these variables are read/changed by leader/parallel workers
without using any lock (mutex). It seems currently there is no problem
because of the way the patch is using in_parallel_apply_xact but I
think it won't be a good idea to rely on it. I suggest using mutex to
operate on these variables and also check if the slot_no is in a valid
range after reading it in parallel_apply_free_worker, otherwise error
out using elog.

2.
 static void
 apply_handle_stream_stop(StringInfo s)
 {
- if (!in_streamed_transaction)
+ ParallelApplyWorkerInfo *winfo = NULL;
+ TransApplyAction apply_action;
+
+ if (!am_parallel_apply_worker() &&
+ (!in_streamed_transaction && !stream_apply_worker))
  ereport(ERROR,
  (errcode(ERRCODE_PROTOCOL_VIOLATION),
  errmsg_internal("STREAM STOP message without STREAM START")));

This check won't be able to detect missing stream start messages for
parallel apply workers apart from the first pair of start/stop. I
thought of adding in_remote_transaction check along with
am_parallel_apply_worker() to detect the same but that also won't work
because the parallel worker doesn't reset it at the stop message.
Another possibility is to introduce yet another variable for this but
that doesn't seem worth it. I would like to keep this check simple.
Can you think of any better way?

3. I think we can skip sending start/stop messages from the leader to
the parallel worker because unlike apply worker it will process only
one transaction-at-a-time. However, it is not clear whether that is
worth the effort because it is sent after logical_decoding_work_mem
changes. For now, I have added a comment for this in the attached
patch but let me if I am missing something or if I am wrong.

4.
postgres=# select pid, leader_pid, application_name, backend_type from
pg_stat_activity;
  pid  | leader_pid | application_name |         backend_type
-------+------------+------------------+------------------------------
 27624 |            |                  | logical replication launcher
 17336 |            | psql             | client backend
 26312 |            |                  | logical replication worker
 26376 |            | psql             | client backend
 14004 |            |                  | logical replication worker

Here, the second worker entry is for the parallel worker. Isn't it
better if we distinguish this by keeping type as a logical replication
parallel worker? I think for this you need to change bgw_type in
logicalrep_worker_launch().

5. Can we name parallel_apply_subxact_info_add() as
parallel_apply_start_subtrans()?

Apart from the above, I have added/edited a few comments and made a
few other cosmetic changes in the attached.

-- 
With Regards,
Amit Kapila.

Attachment

changes_atop_v33_1_amit.patch

RE: Perform streaming logical transactions by background workers and parallel apply

From

"wangw.fnst@fujitsu.com"

Date:

26 September 2022, 03:09:55

On Thur, Sep 22, 2022 at 16:08 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote:
> Dear Wang,
> 
> Thanks for updating the patch! Followings are comments for v33-0001.

Thanks for your comments.

> ===
> libpqwalreceiver.c
> 
> 01. inclusion
> 
> ```
> +#include "catalog/pg_subscription.h"
> ```
> 
> We don't have to include it because the analysis of parameters is done at caller.
> 
> ===
> launcher.c

Improved.

> 02. logicalrep_worker_launch()
> 
> ```
> +       /*
> +        * Return silently if the number of parallel apply workers reached the
> +        * limit per subscription.
> +        */
> +       if (is_subworker && nparallelapplyworkers >=
> max_parallel_apply_workers_per_subscription)
> ```
> 
> a.
> I felt that it might be kind if we output some debug messages.
> 
> b.
> The if statement seems to be more than 80 characters. You can move to new
> line around "nparallelapplyworkers >= ...".

Improved.

> ===
> applyparallelworker.c
> 
> 03. declaration
> 
> ```
> +/*
> + * Is there a message pending in parallel apply worker which we need to
> + * receive?
> + */
> +volatile bool ParallelApplyMessagePending = false;
> ```
> 
> I checked other flags that are set by signal handlers, their datatype seemed to
> be sig_atomic_t.
> Is there any reasons that you use normal bool? It should be changed if not.

Improved.

> 04. HandleParallelApplyMessages()
> 
> ```
> +               if (winfo->error_mq_handle == NULL)
> +                       continue;
> ```
> 
> a.
> I was not sure when the cell should be cleaned. Currently we clean up
> ParallelApplyWorkersList() only in the parallel_apply_start_worker(),
> but we have chances to remove such a cell like HandleParallelApplyMessages()
> or HandleParallelApplyMessage(). How do you think?
> 
> b.
> Comments should be added even if we keep this, like "exited worker, skipped".
> 
> ```
> +               else
> +                       ereport(ERROR,
> +
> (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +                                        errmsg("lost connection to the leader apply worker")));
> ```
> 
> c.
> This function is called on the leader apply worker, so the hint should be "lost
> connection to the parallel apply worker".

=>b.
Added the following comment according to your suggestion.
`Skip if worker has exited`

=>c.
Fixed.

> ===
> worker.c
> 
> 07. handle_streamed_transaction()
> 
> ```
> + * For non-streamed transactions, returns false;
> ```
> 
> "returns false;" -> "returns false"

Improved. I changed the semicolon to a period

> apply_handle_commit_prepared(), apply_handle_abort_prepared()
> 
> These functions are not expected that parallel worker calls
> so I think Assert() should be added.

I am not sure if this modification is necessary since we do not modify the
non-streamed transaction related message like "COMMIT PREPARED" or "ROLLBACK
PREPARED".

> 08. UpdateWorkerStats()
> 
> ```
> -static void
> +void
>  UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
> ```
> 
> This function is called only in worker.c, should be static.
> 
> 09. subscription_change_cb()
> 
> ```
> -static void
> +void
>  subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
> ```
> 
> This function is called only in worker.c, should be static.

Improved.

> 10. InitializeApplyWorker()
> 
> ```
> +/*
> + * Initialize the database connection, in-memory subscription and necessary
> + * config options.
> + */
>  void
> -ApplyWorkerMain(Datum main_arg)
> +InitializeApplyWorker(void)
> ```
> 
> Some comments should be added about this is a common part of leader and
> parallel apply worker.

Added the following comment:
`The common initialization for leader apply worker and parallel apply worker.`

> ===
> logicalrepworker.h
> 
> 11. declaration
> 
> ```
> extern PGDLLIMPORT volatile bool ParallelApplyMessagePending;
> ```
> 
> Please refer above comment.
> 
> ===
> guc_tables.c

Improved.

Also rebased the patch set based on the changes in HEAD (26f7802).

Attach the new patch set.

Regards,
Wang wei

On Saturday, September 24, 2022 7:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Thu, Sep 22, 2022 at 3:41 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > On Thu, Sep 22, 2022 at 8:59 AM wangw.fnst@fujitsu.com
> > <wangw.fnst@fujitsu.com> wrote:
> > >
> >
> > Few comments on v33-0001
> > =======================
> >
> 
> Some more comments on v33-0001
> =============================
> 1.
> + /* Information from the corresponding LogicalRepWorker slot. */
> + uint16 logicalrep_worker_generation;
> +
> + int logicalrep_worker_slot_no;
> +} ParallelApplyWorkerShared;
> 
> Both these variables are read/changed by leader/parallel workers without
> using any lock (mutex). It seems currently there is no problem because of the
> way the patch is using in_parallel_apply_xact but I think it won't be a good idea
> to rely on it. I suggest using mutex to operate on these variables and also check
> if the slot_no is in a valid range after reading it in parallel_apply_free_worker,
> otherwise error out using elog.

Changed.

> 2.
>  static void
>  apply_handle_stream_stop(StringInfo s)
>  {
> - if (!in_streamed_transaction)
> + ParallelApplyWorkerInfo *winfo = NULL; TransApplyAction apply_action;
> +
> + if (!am_parallel_apply_worker() &&
> + (!in_streamed_transaction && !stream_apply_worker))
>   ereport(ERROR,
>   (errcode(ERRCODE_PROTOCOL_VIOLATION),
>   errmsg_internal("STREAM STOP message without STREAM START")));
> 
> This check won't be able to detect missing stream start messages for parallel
> apply workers apart from the first pair of start/stop. I thought of adding
> in_remote_transaction check along with
> am_parallel_apply_worker() to detect the same but that also won't work
> because the parallel worker doesn't reset it at the stop message.
> Another possibility is to introduce yet another variable for this but that doesn't
> seem worth it. I would like to keep this check simple.
> Can you think of any better way?

I feel we can reuse the in_streamed_transaction in parallel apply worker to
simplify the check there. I tried to set this flag in parallel apply worker
when stream starts and reset it when stream stop so that we can directly check
this flag for duplicate stream start message and other related things.

> 3. I think we can skip sending start/stop messages from the leader to the
> parallel worker because unlike apply worker it will process only one
> transaction-at-a-time. However, it is not clear whether that is worth the effort
> because it is sent after logical_decoding_work_mem changes. For now, I have
> added a comment for this in the attached patch but let me if I am missing
> something or if I am wrong.

I the suggested comments look good. 

> 4.
> postgres=# select pid, leader_pid, application_name, backend_type from
> pg_stat_activity;
>   pid  | leader_pid | application_name |         backend_type
> -------+------------+------------------+------------------------------
>  27624 |            |                  | logical replication launcher
>  17336 |            | psql             | client backend
>  26312 |            |                  | logical replication worker
>  26376 |            | psql             | client backend
>  14004 |            |                  | logical replication worker
> 
> Here, the second worker entry is for the parallel worker. Isn't it better if we
> distinguish this by keeping type as a logical replication parallel worker? I think
> for this you need to change bgw_type in logicalrep_worker_launch().

Changed.

> 5. Can we name parallel_apply_subxact_info_add() as
> parallel_apply_start_subtrans()?
> 
> Apart from the above, I have added/edited a few comments and made a few
> other cosmetic changes in the attached.

Changed.

Best regards,
Hou zj

On Friday, September 30, 2022 4:27 PM Peter Smith <smithpb2250@gmail.com> wrote:
> 
> Here are my review comments for the v35-0001 patch:

Thanks for the comments.


> 3. GENERAL
> I found the mixed use of the same member names having different meanings to be quite confusing.
> 
> e.g.1
> PGOutputData 'streaming' is now a single char internal representation the subscription parameter streaming mode
('f','t','p')
> - bool streaming;
> + char streaming;
> 
> e.g.2
> WalRcvStreamOptions 'streaming' is a C string version of the subscription streaming mode ("on", "parallel")
> - bool streaming; /* Streaming of large transactions */
> + char    *streaming; /* Streaming of large transactions */
> 
> e.g.3
> SubOpts 'streaming' is again like the first example - a single char for the mode.
> - bool streaming;
> + char streaming;
> 
> 
> IMO everything would become much simpler if you did:
> 
> 3a.
> Rename "char streaming;" -> "char streaming_mode;"

The word 'streaming' is the same as the actual option name, so personally I think it's fine.
But if others also agreed that the name can be improved, I can change it.

> 
> 3b. Re-designed the "char *streaming;" code to also use the single char
> notation, then also call that member 'streaming_mode'. Then everything will
> be > consistent.

If we use single byte(char) here we would need to compare it with the standard
streaming option value in libpqwalreceiver.c which was suggested not to do[1].


> 4. - max_parallel_apply_workers_per_subscription
> +       </para>
> +       <para>
> +        The parallel apply workers are taken from the pool defined by
> +        <varname>max_logical_replication_workers</varname>.
> +       </para>
> +       <para>
> +        The default value is 2. This parameter can only be set in the
> +        <filename>postgresql.conf</filename> file or on the server command
> +        line.
> +       </para>
> +      </listitem>
> +     </varlistentry>
> 
> I felt that maybe this should also xref to the
> doc/src/sgml/logical-replication.sgml section where you say about
> "max_logical_replication_workers should be increased according to the
> desired number of parallel apply workers."

Not sure about this as we don't have similar thing in the document of
max_logical_replication_workers and max_sync_workers_per_subscription.


> ======
> 
> 7. src/backend/access/transam/xact.c - RecordTransactionAbort
> 
> 
> + /*
> + * Are we using the replication origins feature?  Or, in other words, 
> + are
> + * we replaying remote actions?
> + */
> + replorigin = (replorigin_session_origin != InvalidRepOriginId &&
> +   replorigin_session_origin != DoNotReplicateId);
> 
> "Or, in other words," -> "In other words,"

I think it is better to keep consistent with the comments in function
RecordTransactionCommit.


> 10b.
> IMO this flag might be better to be called 'parallel_apply_enabled' or something similar.
> (see also review comment #55b.)

Not sure about this.

> 12. - parallel_apply_free_worker
> 
> + SpinLockAcquire(&winfo->shared->mutex);
> + slot_no = winfo->shared->logicalrep_worker_slot_no;
> + generation = winfo->shared->logicalrep_worker_generation;
> + SpinLockRelease(&winfo->shared->mutex);
> 
> I know there are not many places doing this, but do you think it might be
> worth introducing some new set/get function to encapsulate the set/get of the
> >generation/slot so it does the mutex spin-locks in common code?

Not sure about this.

> 13. - LogicalParallelApplyLoop
> 
> + /*
> + * Init the ApplyMessageContext which we clean up after each 
> + replication
> + * protocol message.
> + */
> + ApplyMessageContext = AllocSetContextCreate(ApplyContext,
> + "ApplyMessageContext",
> + ALLOCSET_DEFAULT_SIZES);
> 
> Because this is in the parallel apply worker should the name (e.g. the 2nd
> param) be changed to "ParallelApplyMessageContext"?

Not sure about this, because ApplyMessageContext is used in both worker.c and
applyparallelworker.c.


> + else if (is_subworker)
> + snprintf(bgw.bgw_name, BGW_MAXLEN,
> + "logical replication parallel apply worker for subscription %u", 
> + subid);
>   else
>   snprintf(bgw.bgw_name, BGW_MAXLEN,
>   "logical replication worker for subscription %u", subid);
> 
> I think that *last* text now be changed like below:
> 
> BEFORE
> "logical replication worker for subscription %u"
> AFTER
> "logical replication apply worker for subscription %u"

I am not sure if it's a good idea to change existing process description.


> 36 - should_apply_changes_for_rel
>  should_apply_changes_for_rel(LogicalRepRelMapEntry *rel)  {
>   if (am_tablesync_worker())
>   return MyLogicalRepWorker->relid == rel->localreloid;
> + else if (am_parallel_apply_worker())
> + {
> + if (rel->state != SUBREL_STATE_READY)
> + ereport(ERROR,
> + (errmsg("logical replication apply workers for subscription \"%s\"
> will restart",
> + MySubscription->name),
> + errdetail("Cannot handle streamed replication transaction using parallel "
> +    "apply workers until all tables are synchronized.")));
> +
> + return true;
> + }
>   else
>   return (rel->state == SUBREL_STATE_READY ||
>   (rel->state == SUBREL_STATE_SYNCDONE && @@ -427,43 +519,87 @@ end_replication_step(void)
> 
> This function can be made tidier just by removing all the 'else' ...

I feel the current style looks better.


> 40. - apply_handle_stream_prepare
> 
> + case TRANS_LEADER_SERIALIZE:
> 
> - /* Mark the transaction as prepared. */
> - apply_handle_prepare_internal(&prepare_data);
> + /*
> + * The transaction has been serialized to file, so replay all the
> + * spooled operations.
> + */
> 
> Spurious blank line after the 'case'.

Personally, I think this style is fine.


> 48. - ApplyWorkerMain
> 
> +/* Logical Replication Apply worker entry point */ void 
> +ApplyWorkerMain(Datum main_arg)
> 
> "Apply worker" -> "apply worker"

Since it's the existing comment, I feel we can leave this.


> + /*
> + * We don't currently need any ResourceOwner in a walreceiver process, 
> + but
> + * if we did, we could call CreateAuxProcessResourceOwner here.
> + */
> 
> I think this comment should have "XXX" prefix.

I am not sure as this comment is just a reminder.


> 50.
> 
> + if (server_version >= 160000 &&
> + MySubscription->stream == SUBSTREAM_PARALLEL)
> + {
> + options.proto.logical.streaming = pstrdup("parallel");
> + MyLogicalRepWorker->parallel_apply = true;
> + }
> + else if (server_version >= 140000 &&
> + MySubscription->stream != SUBSTREAM_OFF)
> + options.proto.logical.streaming = pstrdup("on"); else 
> + options.proto.logical.streaming = NULL;
> 
> IMO it might make more sense for these conditions to be checking the
> 'options.proto.logical.proto_version' here instead of checking the hardwired
> server > versions. Also, I suggest may be better (for clarity) to always
> assign the parallel_apply member.

Currently, the proto_version is only checked at publisher, I am not sure if
it's a good idea to check it here.

> 52. - get_transaction_apply_action
> 
> + /*
> + * Check if we are processing this transaction using a parallel apply
> + * worker and if so, send the changes to that worker.
> + */
> + else if ((*winfo = parallel_apply_find_worker(xid)))  {  return 
> +TRANS_LEADER_SEND_TO_PARALLEL;  }  else  {  return 
> +TRANS_LEADER_SERIALIZE;  } }
> 
> 52a.
> All these if/else and code blocks seem excessive. It can be simplified as follows:

I feel this style is fine.

> 52b.
> Can a tablesync worker ever get here? It might be better to
> Assert(!am_tablesync_worker()); at top of this function?

Not sure if it's necessary or not.


> 55b.
> IMO this member name should be named slightly different to give a better feel
> for what it really means.
> 
> Maybe something like one of:
> "parallel_apply_ok"
> "parallel_apply_enabled"
> "use_parallel_apply"
> etc?

I feel the current name is fine. But if others also feel the same, I can try to
rename it.

> 57. - am_leader_apply_worker
> 
> +static inline bool
> +am_leader_apply_worker(void)
> +{
> + return (!OidIsValid(MyLogicalRepWorker->relid) &&  
> +!isParallelApplyWorker(MyLogicalRepWorker));
> +}
> 
> I wondered if it would be tidier/easier to define this function like below.
> The others are inline functions anyhow so it should end up as the same >
> thing, right?
> 
> static inline bool
> am_leader_apply_worker(void)
> {
> return (!am_tablesync_worker() && !am_parallel_apply_worker); }

I feel the current style is fine.

>--- fail - streaming must be boolean
>+-- fail - streaming must be boolean or 'parallel'
> CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect =
>false,streaming = foo);
 
>
>I think there are tests already for explicitly create/set the subscription
>parameter streaming = on/off/parallel
>
>But what about when there is no value explicitly specified? Shouldn't there
>also be tests like below to check that *implied* boolean true still works for
>this enum?

I didn't find similar tests for no value explicitly specified cases, so I didn't add this
for now.

Attach the new version patch set which addressed most of the comments.

[1] https://www.postgresql.org/message-id/CAA4eK1LMVdS6uM7Tw7ANL0BetAd76TKkmAXNNQa0haTe2tax6g%40mail.gmail.com

Best regards,
Hou zj

Attachment

Re: Perform streaming logical transactions by background workers and parallel apply

From

Masahiko Sawada

Date:

11 October 2022, 00:22:08

On Fri, Oct 7, 2022 at 2:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Oct 7, 2022 at 8:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Thu, Oct 6, 2022 at 9:04 PM houzj.fnst@fujitsu.com
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > I think the root reason for this kind of deadlock problems is the table
> > > structure difference between publisher and subscriber(similar to the unique
> > > difference reported earlier[1]). So, I think we'd better disallow this case. For
> > > example to avoid the reported problem, we could only support parallel apply if
> > > pubviaroot is false on publisher and replicated tables' types(relkind) are the
> > > same between publisher and subscriber.
> > >
> > > Although it might restrict some use cases, but I think it only restrict the
> > > cases when the partitioned table's structure is different between publisher and
> > > subscriber. User can still use parallel apply for cases when the table
> > > structure is the same between publisher and subscriber which seems acceptable
> > > to me. And we can also document that the feature is expected to be used for the
> > > case when tables' structure are the same. Thoughts ?
> >
> > I'm concerned that it could be a big restriction for users. Having
> > different partitioned table's structures on the publisher and the
> > subscriber is quite common use cases.
> >
> > From the feature perspective, the root cause seems to be the fact that
> > the apply worker does both receiving and applying changes. Since it
> > cannot receive the subsequent messages while waiting for a lock on a
> > table, the parallel apply worker also cannot move forward. If we have
> > a dedicated receiver process, it can off-load the messages to the
> > worker while another process waiting for a lock. So I think that
> > separating receiver and apply worker could be a building block for
> > parallel-apply.
> >
>
> I think the disadvantage that comes to mind is the overhead of passing
> messages between receiver and applier processes even for non-parallel
> cases. Now, I don't think it is advisable to have separate handling
> for non-parallel cases. The other thing is that we need to someway
> deal with feedback messages which helps to move synchronous replicas
> and update subscriber's progress which in turn helps to keep the
> restart point updated. These messages also act as heartbeat messages
> between walsender and walapply process.
>
> To deal with this, one idea is that we can have two connections to
> walsender process, one with walreceiver and the other with walapply
> process which according to me could lead to a big increase in resource
> consumption and it will bring another set of complexities in the
> system. Now, in this, I think we have two possibilities, (a) The first
> one is that we pass all messages to the leader apply worker and then
> it decides whether to execute serially or pass it to the parallel
> apply worker. However, that can again deadlock in the truncate
> scenario we discussed because the main apply worker won't be able to
> receive new messages once it is blocked at the truncate command. (b)
> The second one is walreceiver process itself takes care of passing
> streaming transactions to parallel apply workers but if we do that
> then walreceiver needs to wait at the transaction end to maintain
> commit order which means it can also lead to deadlock in case the
> truncate happens in a streaming xact.

I imagined (b) but I had missed the point of preserving the commit
order. Separating the receiver and apply worker cannot resolve this
problem.

>
> The other alternative is that we allow walreceiver process to wait for
> apply process to finish transaction and send the feedback but that
> seems to be again an overhead if we have to do it even for small
> transactions, especially it can delay sync replication cases. Even, if
> we don't consider overhead, it can still lead to a deadlock because
> walreceiver won't be able to move in the scenario we are discussing.
>
> About your point that having different partition structures for
> publisher and subscriber, I don't know how common it will be once we
> have DDL replication. Also, the default value of
> publish_via_partition_root is false which doesn't seem to indicate
> that this is a quite common case.

So how can we consider these concurrent issues that could happen only
when streaming = 'parallel'? Can we restrict some use cases to avoid
the problem or can we have a safeguard against these conflicts? We
could find a new problematic scenario in the future and if it happens,
logical replication gets stuck, it cannot be resolved only by apply
workers themselves.

Regards,

-- 
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

RE: Perform streaming logical transactions by background workers and parallel apply

From

"wangw.fnst@fujitsu.com"

Date:

12 October 2022, 02:10:59

On Fri, Oct 7, 2022 at 14:18 PM Hou, Zhijie/侯 志杰 <houzj.fnst@cn.fujitsu.com> wrote:
> Attach the new version patch set which addressed most of the comments.

Rebased the patch set because the new change in HEAD (776e1c8).

Attach the new patch set.

Regards,
Wang wei

On Friday, October 14, 2022 12:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Wed, Oct 12, 2022 at 7:41 AM wangw.fnst@fujitsu.com
> <wangw.fnst@fujitsu.com> wrote:
> >
> > On Fri, Oct 7, 2022 at 14:18 PM Hou, Zhijie/侯 志杰
> <houzj.fnst@cn.fujitsu.com> wrote:
> > > Attach the new version patch set which addressed most of the comments.
> >
> > Rebased the patch set because the new change in HEAD (776e1c8).
> >
> > Attach the new patch set.
> >
> 
> +static void
> +HandleParallelApplyMessage(ParallelApplyWorkerInfo *winfo, StringInfo
> +msg)
> {
> ...
> + case 'X': /* Terminate, indicating clean exit */ {
> + shm_mq_detach(winfo->error_mq_handle);
> + winfo->error_mq_handle = NULL;
> + break;
> + }
> ...
> }
> 
> I don't see the use of this message in the patch. If this is not required by the
> latest version then we can remove it and its corresponding handling in
> parallel_apply_start_worker(). I am referring to the below code in
> parallel_apply_start_worker():

Thanks for the comments, I removed these codes in the new version patch set.

I also did the following changes in the new version patch:

[0001] 
* Teach the parallel apply worker to catch the subscription parameter change in
the main loop so that user can change the streaming option to "on" to stop
the parallel apply workers in case the leader apply workers get stuck because of
some deadlock problems discussed in [1].

* Some cosmetic changes.

* Address comments from Peter[2].

[0004]
* Disallow replicating from or to a partitioned table in parallel streaming
mode. This is to avoid the deadlock cases when the partitioned table's
inheritance structure is different between publisher and subscriber as
discussed [1].

[1] https://www.postgresql.org/message-id/CAA4eK1JYFXEoFhJAvg1qU%3DnZrZLw_87X%3D2YWQGFBbcBGirAUwA%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAHut%2BPvxL8tJ2ZUpEjkbRFe6qKSH%2Br54BQ7wM8p%3D335tUbuXbg%40mail.gmail.com

Best regards,
Hou zj

On Tuesday, October 18, 2022 10:36 AM Peter Smith <smithpb2250@gmail.com> wrote:
> Hi, here are my review comments for patch v38-0001.

Thanks for your comments.

> ======
> 
> .../replication/logical/applyparallelworker.c
> 
> 1. parallel_apply_start_worker
> 
> + /* Try to get a free parallel apply worker. */ foreach(lc, 
> + ParallelApplyWorkersList) { ParallelApplyWorkerInfo *tmp_winfo;
> +
> + tmp_winfo = (ParallelApplyWorkerInfo *) lfirst(lc);
> +
> + if (!tmp_winfo->in_use)
> + {
> + /* Found a worker that has not been assigned a transaction. */ winfo 
> + = tmp_winfo; break; } }
> 
> The "Found a worker..." comment seems redundant because it's already 
> clear from the prior comment and the 'in_use' member what this code is 
> doing.

Removed.

> ~~~
> 
> 2. LogicalParallelApplyLoop
> 
> + void    *data;
> + Size len;
> + int c;
> + int rc;
> + StringInfoData s;
> + MemoryContext oldctx;
> 
> Several of these vars (like 'c', 'rc', 's') can be declared deeper - 
> e.g. only in the scope where they are actually used.

Changed.

> ~~~
> 
> 3.
> 
> + /* Ensure we are reading the data into our memory context. */ oldctx 
> + = MemoryContextSwitchTo(ApplyMessageContext);
> 
> Doesn't something need to switch back to this 'oldctx' prior to 
> breaking out of the for(;;) loop?
> 
> ~~~
> 
> 4.
> 
> + apply_dispatch(&s);
> +
> + MemoryContextReset(ApplyMessageContext);
> 
> Isn't this broken now? Since you've removed the 
> MemoryContextSwitchTo(oldctx), so next iteration will switch to 
> ApplyMessageContext again which will overwrite and lose knowledge of 
> the original 'oldctx' (??)

Sorry for the miss, fixed.

> ~~
> 
> 5.
> 
> Maybe this is a silly idea, I'm not sure. Because this is an infinite 
> loop, then instead of the multiple calls to
> MemoryContextReset(ApplyMessageContext) maybe there can be just a 
> single call to it immediately before you switch to that context in the 
> first place. The effect will be the same, won't it?
> 
> e.g.
> + /* Ensure we are reading the data into our memory context. */ 
> + MemoryContextReset(ApplyMessageContext); <=== THIS oldctx = 
> + MemoryContextSwitchTo(ApplyMessageContext);

In SHM_MQ_WOULD_BLOCK branch, we would invoke WaitLatch, so I feel we'd better
reset the memory context before waiting to avoid keeping no longer useful
memory context for more time (although it doesn’t matter too much in practice).
So, I didn't change this for now.

> ~~~
> 
> 6.
> 
> The code logic keeps flip-flopping for several versions. I think if 
> you are going to check all the return types of shm_mq_receive then 
> using a switch(shmq_res) might be a better way than having multiple 
> if/else with some Asserts.

Changed.

> ======
> 
> src/backend/replication/logical/launcher.c
> 
> 7. logicalrep_worker_launch
> 
> Previously I'd suggested ([1] #12) that the process name should change 
> for consistency, and AFAIK Amit also said [2] that would be OK, but 
> this change is still not done in the current patch.

Changed.

> ======
> 
> src/backend/replication/logical/worker.c
> 
> 8. should_apply_changes_for_rel
> 
>  * Should this worker apply changes for given relation.
>  *
>  * This is mainly needed for initial relation data sync as that runs 
> in
>  * separate worker process running in parallel and we need some way to 
> skip
>  * changes coming to the main apply worker during the sync of a table.
> 
> This existing comment refers to the "main apply worker". IMO it should 
> say "leader apply worker" to keep all the terminology consistent.

Changed.

> ~~~
> 
> 9. apply_handle_stream_start
> 
> + *
> + * XXX We can avoid sending pairs of the START/STOP messages to the 
> + parallel
> + * worker because unlike apply worker it will process only one 
> + transaction at a
> + * time. However, it is not clear whether that is worth the effort 
> + because it
> + * is sent after logical_decoding_work_mem changes.
>   */
>  static void
>  apply_handle_stream_start(StringInfo s)
> 
> As previously mentioned ([1] #13b) it's not obvious to me what that 
> last sentence means. e.g. "because it is sent"  - what is "it"?

Changed as Amit's suggestion in [1].

> ~~~
> 
> 11.
> 
> + /*
> + * Assign the appropriate streaming flag according to the 'streaming' 
> + mode
> + * and the publisher's ability to support that mode.
> + */
> 
> Maybe "streaming flag" ->  "streaming string/flag". (sorry, it was my 
> bad suggestion last time)

Improved.

Attach the version patch set.

[1] - https://www.postgresql.org/message-id/CAA4eK1%2BqwbD419%3DKgRTLRVj5zQhbM%3Dbfi-cvWG3HkORktb4-YA%40mail.gmail.com

Best Regards
Hou Zhijie

On Thursday, October 20, 2022 5:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Oct 20, 2022 at 2:08 PM Peter Smith <smithpb2250@gmail.com>
> wrote:
> >
> > 7. get_transaction_apply_action
> >
> > > 12. get_transaction_apply_action
> > >
> > > I still felt like there should be some tablesync checks/comments in
> > > this function, just for sanity, even if it works as-is now.
> > >
> > > For example, are you saying ([3] #22b) that there might be rare
> > > cases where a Tablesync would call to parallel_apply_find_worker?
> > > That seems strange, given that "for streaming transactions that are
> > > being applied in the parallel ... we disallow applying changes on a
> > > table that is not in the READY state".
> > >
> > > ------
> >
> > Houz wrote [2] -
> >
> > I think because we won't try to start parallel apply worker in table
> > sync worker(see the check in parallel_apply_can_start()), so we won't
> > find any worker in parallel_apply_find_worker() which means
> > get_transaction_apply_action will return TRANS_LEADER_SERIALIZE. And
> > get_transaction_apply_action is a function which can be invoked for
> > all kinds of workers(same is true for all apply_handle_xxx functions),
> > so not sure if table sync check/comment is necessary.
> >
> > ~
> >
> > Sure, and I believe you when you say it all works OK - but IMO there
> > is something still not quite right with this current code. For
> > example,
> >
> > e.g.1 the functional will return TRANS_LEADER_SERIALIZE for Tablesync
> > worker, and yet the comment for TRANS_LEADER_SERIALIZE says "means
> > that we are in the leader apply worker" (except we are not)
> >
> > e.g.2 we know for a fact that Tablesync workers cannot start their own
> > parallel apply workers, so then why do we even let the Tablesync
> > worker make a call to parallel_apply_find_worker() looking for
> > something we know will not be found?
> >
> 
> I don't see much benefit in adding an additional check for tablesync workers
> here. It will unnecessarily make this part of the code look bit ugly.

Thanks for the review, here is the new version patch set which addressed Peter[1]
and Kuroda-san[2]'s comments.

[1] https://www.postgresql.org/message-id/CAHut%2BPs0HXawMD%3DzQ5YUncc9kjGy%2Bmd_39Y4Fdf%3DsKjt-LE92g%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/TYAPR01MB586674C1EE91C06DBACE7728F52B9%40TYAPR01MB5866.jpnprd01.prod.outlook.com

Best regards,
Hou zj

On Tues, Oct 25, 2022 at 14:28 PM Peter Smith <smithpb2250@gmail.com> wrote:
> FYI - After a recent push, the v40-0001 patch no longer applies on the
> latest HEAD.
> 
> [postgres@CentOS7-x64 oss_postgres_misc]$ git apply
> ../patches_misc/v40-0001-Perform-streaming-logical-transactions-by-
> parall.patch
> error: patch failed: src/backend/replication/logical/launcher.c:54
> error: src/backend/replication/logical/launcher.c: patch does not apply

Thanks for your reminder.

I just rebased the patch set for review.
The new patch set will be shared later when the comments in this thread are
addressed.

Regards,
Wang wei

Attachment

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

26 October 2022, 11:19:08

On Tue, Oct 25, 2022 at 8:38 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Oct 21, 2022 at 6:32 PM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
>
> I've started to review this patch. I tested v40-0001 patch and have
> one question:
>
> IIUC even when most of the changes in the transaction are filtered out
> in pgoutput (eg., by relation filter or row filter), the walsender
> sends STREAM_START. This means that the subscriber could end up
> launching parallel apply workers also for almost empty (and streamed)
> transactions. For example, I created three subscriptions each of which
> subscribes to a different table. When I loaded a large amount of data
> into one table, all three (leader) apply workers received START_STREAM
> and launched their parallel apply workers.
>

The apply workers will be launched just the first time then we
maintain a pool so that we don't need to restart them.

> However, two of them
> finished without applying any data. I think this behaviour looks
> problematic since it wastes workers and rather decreases the apply
> performance if the changes are not large. Is it worth considering a
> way to delay launching a parallel apply worker until we find out the
> amount of changes is actually large?
>

I think even if changes are less there may not be much difference
because we have observed that the performance improvement comes from
not writing to file.

> For example, the leader worker
> writes the streamed changes to files as usual and launches a parallel
> worker if the amount of changes exceeds a threshold or the leader
> receives the second segment. After that, the leader worker switches to
> send the streamed changes to parallel workers via shm_mq instead of
> files.
>

I think writing to file won't be a good idea as that can hamper the
performance benefit in some cases and not sure if it is worth.

-- 
With Regards,
Amit Kapila.

RE: Perform streaming logical transactions by background workers and parallel apply

From

"shiy.fnst@fujitsu.com"

Date:

27 October 2022, 02:34:24

On Wed, Oct 26, 2022 7:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Tue, Oct 25, 2022 at 8:38 AM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Oct 21, 2022 at 6:32 PM houzj.fnst@fujitsu.com
> > <houzj.fnst@fujitsu.com> wrote:
> >
> > I've started to review this patch. I tested v40-0001 patch and have
> > one question:
> >
> > IIUC even when most of the changes in the transaction are filtered out
> > in pgoutput (eg., by relation filter or row filter), the walsender
> > sends STREAM_START. This means that the subscriber could end up
> > launching parallel apply workers also for almost empty (and streamed)
> > transactions. For example, I created three subscriptions each of which
> > subscribes to a different table. When I loaded a large amount of data
> > into one table, all three (leader) apply workers received START_STREAM
> > and launched their parallel apply workers.
> >
> 
> The apply workers will be launched just the first time then we
> maintain a pool so that we don't need to restart them.
> 
> > However, two of them
> > finished without applying any data. I think this behaviour looks
> > problematic since it wastes workers and rather decreases the apply
> > performance if the changes are not large. Is it worth considering a
> > way to delay launching a parallel apply worker until we find out the
> > amount of changes is actually large?
> >
> 
> I think even if changes are less there may not be much difference
> because we have observed that the performance improvement comes from
> not writing to file.
> 
> > For example, the leader worker
> > writes the streamed changes to files as usual and launches a parallel
> > worker if the amount of changes exceeds a threshold or the leader
> > receives the second segment. After that, the leader worker switches to
> > send the streamed changes to parallel workers via shm_mq instead of
> > files.
> >
> 
> I think writing to file won't be a good idea as that can hamper the
> performance benefit in some cases and not sure if it is worth.
> 

I tried to test some cases that only a small part of the transaction or an empty
transaction is sent to subscriber, to see if using streaming parallel will bring
performance degradation.

The test was performed ten times, and the average was taken.
The results are as follows. The details and the script of the test is attached.

10% of rows are sent
----------------------------------
HEAD            24.4595
patched         18.4545

5% of rows are sent
----------------------------------
HEAD            21.244
patched         17.9655

0% of rows are sent
----------------------------------
HEAD            18.0605
patched         17.893


It shows that when only 5% or 10% of rows are sent to subscriber, using parallel
apply takes less time than HEAD, and even if all rows are filtered there's no
performance degradation.


Regards
Shi yu

Attachment

script.zip

Re: Perform streaming logical transactions by background workers and parallel apply

From

Masahiko Sawada

Date:

28 October 2022, 00:47:08

On Thu, Oct 27, 2022 at 11:34 AM shiy.fnst@fujitsu.com
<shiy.fnst@fujitsu.com> wrote:
>
> On Wed, Oct 26, 2022 7:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Oct 25, 2022 at 8:38 AM Masahiko Sawada
> > <sawada.mshk@gmail.com> wrote:
> > >
> > > On Fri, Oct 21, 2022 at 6:32 PM houzj.fnst@fujitsu.com
> > > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > I've started to review this patch. I tested v40-0001 patch and have
> > > one question:
> > >
> > > IIUC even when most of the changes in the transaction are filtered out
> > > in pgoutput (eg., by relation filter or row filter), the walsender
> > > sends STREAM_START. This means that the subscriber could end up
> > > launching parallel apply workers also for almost empty (and streamed)
> > > transactions. For example, I created three subscriptions each of which
> > > subscribes to a different table. When I loaded a large amount of data
> > > into one table, all three (leader) apply workers received START_STREAM
> > > and launched their parallel apply workers.
> > >
> >
> > The apply workers will be launched just the first time then we
> > maintain a pool so that we don't need to restart them.
> >
> > > However, two of them
> > > finished without applying any data. I think this behaviour looks
> > > problematic since it wastes workers and rather decreases the apply
> > > performance if the changes are not large. Is it worth considering a
> > > way to delay launching a parallel apply worker until we find out the
> > > amount of changes is actually large?
> > >
> >
> > I think even if changes are less there may not be much difference
> > because we have observed that the performance improvement comes from
> > not writing to file.
> >
> > > For example, the leader worker
> > > writes the streamed changes to files as usual and launches a parallel
> > > worker if the amount of changes exceeds a threshold or the leader
> > > receives the second segment. After that, the leader worker switches to
> > > send the streamed changes to parallel workers via shm_mq instead of
> > > files.
> > >
> >
> > I think writing to file won't be a good idea as that can hamper the
> > performance benefit in some cases and not sure if it is worth.
> >
>
> I tried to test some cases that only a small part of the transaction or an empty
> transaction is sent to subscriber, to see if using streaming parallel will bring
> performance degradation.
>
> The test was performed ten times, and the average was taken.
> The results are as follows. The details and the script of the test is attached.
>
> 10% of rows are sent
> ----------------------------------
> HEAD            24.4595
> patched         18.4545
>
> 5% of rows are sent
> ----------------------------------
> HEAD            21.244
> patched         17.9655
>
> 0% of rows are sent
> ----------------------------------
> HEAD            18.0605
> patched         17.893
>
>
> It shows that when only 5% or 10% of rows are sent to subscriber, using parallel
> apply takes less time than HEAD, and even if all rows are filtered there's no
> performance degradation.

Thank you for the testing!

I think this performance improvement comes from both applying changes
in parallel to receiving changes and avoiding writing a file. I'm
happy to know there is also a benefit also for small streaming
transactions. I've also measured the overhead when processing
streaming empty transactions and confirmed the overhead is negligible.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

RE: Perform streaming logical transactions by background workers and parallel apply

From

"shiy.fnst@fujitsu.com"

Date:

28 October 2022, 09:34:16

On Tue, Oct 25, 2022 2:56 PM Wang, Wei/王 威 <wangw.fnst@fujitsu.com> wrote:
> 
> On Tues, Oct 25, 2022 at 14:28 PM Peter Smith <smithpb2250@gmail.com>
> wrote:
> > FYI - After a recent push, the v40-0001 patch no longer applies on the
> > latest HEAD.
> >
> > [postgres@CentOS7-x64 oss_postgres_misc]$ git apply
> > ../patches_misc/v40-0001-Perform-streaming-logical-transactions-by-
> > parall.patch
> > error: patch failed: src/backend/replication/logical/launcher.c:54
> > error: src/backend/replication/logical/launcher.c: patch does not apply
> 
> Thanks for your reminder.
> 
> I just rebased the patch set for review.
> The new patch set will be shared later when the comments in this thread are
> addressed.
> 

I tried to write a draft patch to force streaming every change instead of
waiting until logical_decoding_work_mem is exceeded, which could help to test
streaming parallel. Attach the patch. This is based on v41-0001 patch.

With this patch, I saw a problem that the subscription option "origin" doesn't
work when using streaming parallel. That's because when the parallel apply
worker writing the WAL for the changes, replorigin_session_origin is
InvalidRepOriginId. In current patch, origin can be active only in one process
at-a-time.

To fix it, maybe we need to remove this restriction, like what we did in the old
version of patch.

Regards
Shi yu

Attachment

0001-Allow-streaming-every-change-instead-of-waiting-till_patch

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

31 October 2022, 10:44:06

On Fri, Oct 28, 2022 at 3:04 PM shiy.fnst@fujitsu.com
<shiy.fnst@fujitsu.com> wrote:
>
> On Tue, Oct 25, 2022 2:56 PM Wang, Wei/王 威 <wangw.fnst@fujitsu.com> wrote:
>
> I tried to write a draft patch to force streaming every change instead of
> waiting until logical_decoding_work_mem is exceeded, which could help to test
> streaming parallel. Attach the patch. This is based on v41-0001 patch.
>

Thanks, I think this is quite useful for testing.

> With this patch, I saw a problem that the subscription option "origin" doesn't
> work when using streaming parallel. That's because when the parallel apply
> worker writing the WAL for the changes, replorigin_session_origin is
> InvalidRepOriginId. In current patch, origin can be active only in one process
> at-a-time.
>
> To fix it, maybe we need to remove this restriction, like what we did in the old
> version of patch.
>

Agreed, we need to allow using origins for writing all the changes by
the parallel worker.


--
With Regards,
Amit Kapila.

Re: Perform streaming logical transactions by background workers and parallel apply

From

Masahiko Sawada

Date:

02 November 2022, 02:50:01

On Mon, Oct 24, 2022 at 8:42 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Oct 12, 2022 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Oct 11, 2022 at 5:52 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Fri, Oct 7, 2022 at 2:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > About your point that having different partition structures for
> > > > publisher and subscriber, I don't know how common it will be once we
> > > > have DDL replication. Also, the default value of
> > > > publish_via_partition_root is false which doesn't seem to indicate
> > > > that this is a quite common case.
> > >
> > > So how can we consider these concurrent issues that could happen only
> > > when streaming = 'parallel'? Can we restrict some use cases to avoid
> > > the problem or can we have a safeguard against these conflicts?
> > >
> >
> > Yeah, right now the strategy is to disallow parallel apply for such
> > cases as you can see in *0003* patch.
>
> Tightening the restrictions could work in some cases but there might
> still be coner cases and it could reduce the usability. I'm not really
> sure that we can ensure such a deadlock won't happen with the current
> restrictions. I think we need something safeguard just in case. For
> example, if the leader apply worker is waiting for a lock acquired by
> its parallel worker, it cancels the parallel worker's transaction,
> commits its transaction, and restarts logical replication. Or the
> leader can log the deadlock to let the user know.
>

As another direction, we could make the parallel apply feature robust
if we can detect deadlocks that happen among the leader worker and
parallel workers. I'd like to summarize the idea discussed off-list
(with Amit, Hou-San, and Kuroda-San) for discussion. The basic idea is
that when the leader worker or parallel worker needs to wait for
something (eg. transaction completion, messages) we use lmgr
functionality so that we can create wait-for edges and detect
deadlocks in lmgr.

For example, a scenario where a deadlock occurs is the following:

[Publisher]
create table tab1(a int);
create publication pub for table tab1;

[Subcriber]
creat table tab1(a int primary key);
create subscription sub connection 'port=10000 dbname=postgres'
publication pub with (streaming = parallel);

TX1:
BEGIN;
INSERT INTO tab1 SELECT i FROM generate_series(1, 5000) s(i); -- streamed
    Tx2:
    BEGIN;
    INSERT INTO tab1 SELECT i FROM generate_series(1, 5000) s(i); -- streamed
    COMMIT;
COMMIT;

Suppose a parallel apply worker (PA-1) is executing TX-1 and the
leader apply worker (LA) is executing TX-2 concurrently on the
subscriber. Now, LA is waiting for PA-1 because of the unique key of
tab1 while PA-1 is waiting for LA to send further messages. There is a
deadlock between PA-1 and LA but lmgr cannot detect it.

One idea to resolve this issue is that we have LA acquire a session
lock on a shared object (by LockSharedObjectForSession()) and have
PA-1 wait on the lock before trying to receive messages. IOW,  LA
acquires the lock before sending STREAM_STOP and releases it if
already acquired before sending STREAM_START, STREAM_PREPARE and
STREAM_COMMIT. For PA-1, it always needs to acquire the lock after
processing STREAM_STOP and then release immediately after acquiring
it. That way, when PA-1 is waiting for LA, we can have a wait-edge
from PA-1 to LA in lmgr, which will make a deadlock in lmgr like:

LA (waiting to acquire lock) -> PA-1 (waiting to acquire the shared
object) -> LA

We would need the shared objects per parallel apply worker.

After detecting a deadlock, we can restart logical replication with
temporarily disabling the parallel apply, which is done by 0005 patch.

Another scenario is similar to the previous case but TX-1 and TX-2 are
executed by two parallel apply workers (PA-1 and PA-2 respectively).
In this scenario, PA-2 is waiting for PA-1 to complete its transaction
while PA-1 is waiting for subsequent input from LA. Also, LA is
waiting for PA-2 to complete its transaction in order to preserve the
commit order. There is a deadlock among three processes but it cannot
be detected in lmgr because the fact that LA is waiting for PA-2 to
complete its transaction doesn't appear in lmgr (see
parallel_apply_wait_for_xact_finish()). To fix it, we can use
XactLockTableWait() instead.

However, since XactLockTableWait() considers PREPARED TRANSACTION as
still in progress, probably we need a similar trick as above in case
where a transaction is prepared. For example, suppose that TX-2 was
prepared instead of committed in the above scenario, PA-2 acquires
another shared lock at START_STREAM and releases it at
STREAM_COMMIT/PREPARE. LA can wait on the lock.

Yet another scenario where LA has to wait is the case where the shm_mq
buffer is full. In the above scenario (ie. PA-1 and PA-2 are executing
transactions concurrently), if  the shm_mq buffer between LA and PA-2
is full, LA has to wait to send messages, and this wait doesn't appear
in lmgr. To fix it, probably we have to use non-blocking write and
wait with a timeout. If timeout is exceeded, the LA will write to file
and indicate PA-2 that it needs to read file for remaining messages.
Then LA will start waiting for commit which will detect deadlock if
any.

If we can detect deadlocks by having such a functionality or some
other way then we don't need to tighten the restrictions of subscribed
tables' schemas etc.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

03 November 2022, 13:06:35

On Wednesday, November 2, 2022 10:50 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> 
> On Mon, Oct 24, 2022 at 8:42 PM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> >
> > On Wed, Oct 12, 2022 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > >
> > > On Tue, Oct 11, 2022 at 5:52 AM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> > > >
> > > > On Fri, Oct 7, 2022 at 2:00 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > > > >
> > > > > About your point that having different partition structures for
> > > > > publisher and subscriber, I don't know how common it will be once we
> > > > > have DDL replication. Also, the default value of
> > > > > publish_via_partition_root is false which doesn't seem to indicate
> > > > > that this is a quite common case.
> > > >
> > > > So how can we consider these concurrent issues that could happen only
> > > > when streaming = 'parallel'? Can we restrict some use cases to avoid
> > > > the problem or can we have a safeguard against these conflicts?
> > > >
> > >
> > > Yeah, right now the strategy is to disallow parallel apply for such
> > > cases as you can see in *0003* patch.
> >
> > Tightening the restrictions could work in some cases but there might
> > still be coner cases and it could reduce the usability. I'm not really
> > sure that we can ensure such a deadlock won't happen with the current
> > restrictions. I think we need something safeguard just in case. For
> > example, if the leader apply worker is waiting for a lock acquired by
> > its parallel worker, it cancels the parallel worker's transaction,
> > commits its transaction, and restarts logical replication. Or the
> > leader can log the deadlock to let the user know.
> >
> 
> As another direction, we could make the parallel apply feature robust
> if we can detect deadlocks that happen among the leader worker and
> parallel workers. I'd like to summarize the idea discussed off-list
> (with Amit, Hou-San, and Kuroda-San) for discussion. The basic idea is
> that when the leader worker or parallel worker needs to wait for
> something (eg. transaction completion, messages) we use lmgr
> functionality so that we can create wait-for edges and detect
> deadlocks in lmgr.
> 
> For example, a scenario where a deadlock occurs is the following:
> 
> [Publisher]
> create table tab1(a int);
> create publication pub for table tab1;
> 
> [Subcriber]
> creat table tab1(a int primary key);
> create subscription sub connection 'port=10000 dbname=postgres'
> publication pub with (streaming = parallel);
> 
> TX1:
> BEGIN;
> INSERT INTO tab1 SELECT i FROM generate_series(1, 5000) s(i); -- streamed
>     Tx2:
>     BEGIN;
>     INSERT INTO tab1 SELECT i FROM generate_series(1, 5000) s(i); -- streamed
>     COMMIT;
> COMMIT;
> 
> Suppose a parallel apply worker (PA-1) is executing TX-1 and the
> leader apply worker (LA) is executing TX-2 concurrently on the
> subscriber. Now, LA is waiting for PA-1 because of the unique key of
> tab1 while PA-1 is waiting for LA to send further messages. There is a
> deadlock between PA-1 and LA but lmgr cannot detect it.
> 
> One idea to resolve this issue is that we have LA acquire a session
> lock on a shared object (by LockSharedObjectForSession()) and have
> PA-1 wait on the lock before trying to receive messages. IOW,  LA
> acquires the lock before sending STREAM_STOP and releases it if
> already acquired before sending STREAM_START, STREAM_PREPARE and
> STREAM_COMMIT. For PA-1, it always needs to acquire the lock after
> processing STREAM_STOP and then release immediately after acquiring
> it. That way, when PA-1 is waiting for LA, we can have a wait-edge
> from PA-1 to LA in lmgr, which will make a deadlock in lmgr like:
> 
> LA (waiting to acquire lock) -> PA-1 (waiting to acquire the shared
> object) -> LA
> 
> We would need the shared objects per parallel apply worker.
> 
> After detecting a deadlock, we can restart logical replication with
> temporarily disabling the parallel apply, which is done by 0005 patch.
> 
> Another scenario is similar to the previous case but TX-1 and TX-2 are
> executed by two parallel apply workers (PA-1 and PA-2 respectively).
> In this scenario, PA-2 is waiting for PA-1 to complete its transaction
> while PA-1 is waiting for subsequent input from LA. Also, LA is
> waiting for PA-2 to complete its transaction in order to preserve the
> commit order. There is a deadlock among three processes but it cannot
> be detected in lmgr because the fact that LA is waiting for PA-2 to
> complete its transaction doesn't appear in lmgr (see
> parallel_apply_wait_for_xact_finish()). To fix it, we can use
> XactLockTableWait() instead.
> 
> However, since XactLockTableWait() considers PREPARED TRANSACTION as
> still in progress, probably we need a similar trick as above in case
> where a transaction is prepared. For example, suppose that TX-2 was
> prepared instead of committed in the above scenario, PA-2 acquires
> another shared lock at START_STREAM and releases it at
> STREAM_COMMIT/PREPARE. LA can wait on the lock.
> 
> Yet another scenario where LA has to wait is the case where the shm_mq
> buffer is full. In the above scenario (ie. PA-1 and PA-2 are executing
> transactions concurrently), if  the shm_mq buffer between LA and PA-2
> is full, LA has to wait to send messages, and this wait doesn't appear
> in lmgr. To fix it, probably we have to use non-blocking write and
> wait with a timeout. If timeout is exceeded, the LA will write to file
> and indicate PA-2 that it needs to read file for remaining messages.
> Then LA will start waiting for commit which will detect deadlock if
> any.
> 
> If we can detect deadlocks by having such a functionality or some
> other way then we don't need to tighten the restrictions of subscribed
> tables' schemas etc.

Thanks for the analysis and summary !

I tried to implement the above idea and here is the patch set. I have done some
basic tests for the new codes and it work fine. But I am going to test some
conner cases to make sure all the codes work fine. I removed the old 0003 patch
which was used to check the parallel apply safety because now we can detect the
deadlock problem.

Besides, there are few tasks left which I will handle soon and update the patch set:

* Address previous comment from Amit[1], Shi-san[2] and Peter[3] (Already done but haven't merged them).
* Rebase the original 0005 patch which is "retry to apply streaming xact only in leader apply worker".
* Adjust some comments and documentation related to new codes.

[1] https://www.postgresql.org/message-id/CAA4eK1Lsn%3D_gz1-3LqZ-wEDQDmChUsOX8LvHS8WV39wC1iRR%3DQ%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/OSZPR01MB631042582805A8E8615BC413FD329%40OSZPR01MB6310.jpnprd01.prod.outlook.com
[3] https://www.postgresql.org/message-id/CAHut%2BPsJWHRoRzXtMrJ1RaxmkS2LkiMR_4S2pSionxXmYsyOww%40mail.gmail.com

Best regards,
Hou zj

Attachment

RE: Perform streaming logical transactions by background workers and parallel apply

From

"Hayato Kuroda (Fujitsu)"

Date:

04 November 2022, 07:45:18

Dear Hou,

Thank you for updating the patch!
While testing yours, I found that the leader apply worker has been crashed in the following case.
I will dig the failure more, but I reported here for records.


1. Change macros for forcing to write a temporary file.

```
-#define CHANGES_THRESHOLD      1000
-#define SHM_SEND_TIMEOUT_MS    10000
+#define CHANGES_THRESHOLD      10
+#define SHM_SEND_TIMEOUT_MS    100
```

2. Set logical_decoding_work_mem to 64kB on publisher

3. Insert huge data on publisher

```
publisher=# \d tbl 
                Table "public.tbl"
 Column |  Type   | Collation | Nullable | Default 
--------+---------+-----------+----------+---------
 c      | integer |           |          | 
Publications:
    "pub"


publisher=# BEGIN;
BEGIN
publisher=*# INSERT INTO tbl SELECT i FROM generate_series(1, 5000000) s(i);
INSERT 0 5000000
publisher=*# COMMIT;
```

-> LA crashes on subscriber! Followings are the backtrace.


```
(gdb) bt
#0  0x00007f2663ae4387 in raise () from /lib64/libc.so.6
#1  0x00007f2663ae5a78 in abort () from /lib64/libc.so.6
#2  0x0000000000ad0a95 in ExceptionalCondition (conditionName=0xcabdd0 "mqh->mqh_partial_bytes <= nbytes", 
    fileName=0xcabc30 "../src/backend/storage/ipc/shm_mq.c", lineNumber=420) at ../src/backend/utils/error/assert.c:66
#3  0x00000000008eaeb7 in shm_mq_sendv (mqh=0x271ebd8, iov=0x7ffc664a2690, iovcnt=1, nowait=false, force_flush=true)
    at ../src/backend/storage/ipc/shm_mq.c:420
#4  0x00000000008eac5a in shm_mq_send (mqh=0x271ebd8, nbytes=1, data=0x271f3c0, nowait=false, force_flush=true)
    at ../src/backend/storage/ipc/shm_mq.c:338
#5  0x0000000000880e18 in parallel_apply_free_worker (winfo=0x271f270, xid=735, stop_worker=true)
    at ../src/backend/replication/logical/applyparallelworker.c:368
#6  0x00000000008a3638 in apply_handle_stream_commit (s=0x7ffc664a2790) at
../src/backend/replication/logical/worker.c:2081
#7  0x00000000008a54da in apply_dispatch (s=0x7ffc664a2790) at ../src/backend/replication/logical/worker.c:3195
#8  0x00000000008a5a76 in LogicalRepApplyLoop (last_received=378674872) at
../src/backend/replication/logical/worker.c:3431
#9  0x00000000008a72ac in start_apply (origin_startpos=0) at ../src/backend/replication/logical/worker.c:4245
#10 0x00000000008a7d77 in ApplyWorkerMain (main_arg=0) at ../src/backend/replication/logical/worker.c:4555
#11 0x000000000084983c in StartBackgroundWorker () at ../src/backend/postmaster/bgworker.c:861
#12 0x0000000000854192 in do_start_bgworker (rw=0x26c0d20) at ../src/backend/postmaster/postmaster.c:5801
#13 0x000000000085457c in maybe_start_bgworkers () at ../src/backend/postmaster/postmaster.c:6025
#14 0x000000000085350b in sigusr1_handler (postgres_signal_arg=10) at ../src/backend/postmaster/postmaster.c:5182
#15 <signal handler called>
#16 0x00007f2663ba3b23 in __select_nocancel () from /lib64/libc.so.6
#17 0x000000000084edbc in ServerLoop () at ../src/backend/postmaster/postmaster.c:1768
#18 0x000000000084e737 in PostmasterMain (argc=3, argv=0x2690f60) at ../src/backend/postmaster/postmaster.c:1476
#19 0x000000000074adfb in main (argc=3, argv=0x2690f60) at ../src/backend/main/main.c:197
``` 

PSA the script that can reproduce the failure on my environment. 

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachment

repro.sh

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

04 November 2022, 08:06:42

On Thu, Nov 3, 2022 at 6:36 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> Thanks for the analysis and summary !
>
> I tried to implement the above idea and here is the patch set.
>

Few comments on v42-0001
===========================
1.
+ /*
+ * Set the xact_state flag in the leader instead of the
+ * parallel apply worker to avoid the race condition where the leader has
+ * already started waiting for the parallel apply worker to finish
+ * processing the transaction while the child process has not yet
+ * processed the first STREAM_START and has not set the
+ * xact_state to true.
+ */
+ SpinLockAcquire(&winfo->shared->mutex);
+ winfo->shared->xact_state = PARALLEL_TRANS_UNKNOWN;

The comments and code for xact_state doesn't seem to match.

2.
+ * progress. This could happend as we don't wait for transaction rollback
+ * to finish.
+ */

/happend/happen

3.
+/* Helper function to release a lock with lockid */
+void
+parallel_apply_lock(uint16 lockid)
...
...
+/* Helper function to take a lock with lockid */
+void
+parallel_apply_unlock(uint16 lockid)

Here, the comments seems to be reversed.

4.
+parallel_apply_lock(uint16 lockid)
+{
+ MemoryContext oldcontext;
+
+ if (list_member_int(ParallelApplyLockids, lockid))
+ return;
+
+ LockSharedObjectForSession(SubscriptionRelationId, MySubscription->oid,
+    lockid, am_leader_apply_worker() ?
+    AccessExclusiveLock:
+    AccessShareLock);

This appears odd to me because this forecloses the option the parallel
apply worker can ever acquire this lock in exclusive mode. I think it
would be better to have lock_mode as one of the parameters in this
API.

5.
+ * Inintialize fileset if not yet and open the file.
+ */
+void
+serialize_stream_start(TransactionId xid, bool first_segment)

Typo. /Inintialize/Initialize

6.
parallel_apply_setup_dsm()
{
...
+ shared->xact_state = false;

xact_state should be set with one of the values of ParallelTransState.

7.
/*
+ * Don't use SharedFileSet here because the fileset is shared by the leader
+ * worker and the fileset in leader need to survive after releasing the
+ * shared memory

This comment seems a bit unclear to me. Should there be and between
leader worker? If so, then the following 'and' won't make sense.

8.
+apply_handle_stream_stop(StringInfo s)
{
...
+ case TRANS_PARALLEL_APPLY:
+
+ /*
+ * If there is no message left, wait for the leader to release the
+ * lock and send more messages.
+ */
+ if (pg_atomic_sub_fetch_u32(&(MyParallelShared->left_message), 1) == 0)
+ parallel_apply_lock(MyParallelShared->stream_lock_id);

As per Sawada-San's email [1], this lock should be released
immediately after we acquire it. If we do so, then we don't need to
unlock separately in apply_handle_stream_start() in the below code and
at similar places in stream_prepare, stream_commit, and stream_abort.
Is there a reason for doing it differently?

apply_handle_stream_start(StringInfo s)
{
...
+ case TRANS_PARALLEL_APPLY:
...
+ /*
+ * Unlock the shared object lock so that the leader apply worker
+ * can continue to send changes.
+ */
+ parallel_apply_unlock(MyParallelShared->stream_lock_id);


9.
+parallel_apply_spooled_messages(void)
{
...
+ if (fileset_valid)
+ {
+ in_streamed_transaction = false;
+
+ parallel_apply_lock(MyParallelShared->transaction_lock_id);

Is there a reason to acquire this lock here if the parallel apply
worker will acquire it at stream_start?

10.
+ winfo->shared->stream_lock_id = parallel_apply_get_unique_id();
+ winfo->shared->transaction_lock_id = parallel_apply_get_unique_id();

Why can't we use xid (remote_xid) for one of these and local_xid (one
generated by parallel apply) for the other? I was a bit worried about
the local_xid because it will be generated only after applying the
first message but the patch already seems to be waiting for it in
parallel_apply_wait_for_xact_finish as seen in the below code.

+void
+parallel_apply_wait_for_xact_finish(ParallelApplyWorkerShared *wshared)
+{
+ /*
+ * Wait until the parallel apply worker handles the first message and
+ * set the flag to true.
+ */
+ parallel_apply_wait_for_in_xact(wshared, PARALLEL_TRANS_STARTED);
+
+ /* Wait for the transaction lock to be released. */
+ parallel_apply_lock(wshared->transaction_lock_id);

[1] - https://www.postgresql.org/message-id/CAD21AoCWovvhGBD2uKcQqbk6px6apswuBrs6dR9%2BWhP1j2LdsQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

RE: Perform streaming logical transactions by background workers and parallel apply

From

"Hayato Kuroda (Fujitsu)"

Date:

04 November 2022, 09:46:55

> While testing yours, I found that the leader apply worker has been crashed in the
> following case.
> I will dig the failure more, but I reported here for records.

I found a reason why the leader apply worker crasehes.
In parallel_apply_free_worker() the leader sends the pending message to parallel apply worker:

```
+               /*
+                * Resend the pending message to parallel apply worker to cleanup the
+                * queue. Note that parallel apply worker will just ignore this message
+                * as it has already handled this message while applying spooled
+                * messages.
+                */
+               result = shm_mq_send(winfo->mq_handle, strlen(winfo->pending_msg),
+                                                        winfo->pending_msg, false, true);
```

...but the message length should not be calucarete by strlen() because the logicalrep message has '\0'.
PSA the patch to fix it. It can be applied on v42 patch set.


Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachment

0001-fix-wrong-message-length-estimation.patch

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

04 November 2022, 11:44:33

On Fri, Nov 4, 2022 at 1:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Nov 3, 2022 at 6:36 PM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Thanks for the analysis and summary !
> >
> > I tried to implement the above idea and here is the patch set.
> >
>
> Few comments on v42-0001
> ===========================
>

Few more comments on v42-0001
===============================
1. In parallel_apply_send_data(), it seems winfo->serialize_changes
and switching_to_serialize are set to indicate that we have changed
parallel to serialize mode. Isn't using just the
switching_to_serialize sufficient? Also, it would be better to name
switching_to_serialize as parallel_to_serialize or something like
that.

2. In parallel_apply_send_data(), the patch has already initialized
the fileset, and then again in apply_handle_stream_start(), it will do
the same if we fail while sending stream_start message to the parallel
worker. It seems we don't need to initialize fileset again for
TRANS_LEADER_PARTIAL_SERIALIZE state in apply_handle_stream_start()
unless I am missing something.

3.
apply_handle_stream_start(StringInfo s)
{
...
+ if (!first_segment)
+ {
+ /*
+ * Unlock the shared object lock so that parallel apply worker
+ * can continue to receive and apply changes.
+ */
+ parallel_apply_unlock(winfo->shared->stream_lock_id);
...
}

Can we have an assert before this unlock call that the lock must be
held? Similarly, if there are other places then we can have assert
there as well.

4. It is not very clear to me how maintaining ParallelApplyLockids
list is helpful.

5.
/*
+ * Handle STREAM START message when the transaction was spilled to disk.
+ *
+ * Inintialize fileset if not yet and open the file.
+ */
+void
+serialize_stream_start(TransactionId xid, bool first_segment)
+{
+ /*
+ * Start a transaction on stream start,

This function's name and comments seem to indicate that it is to
handle stream_start message. Is that really the case? It is being
called from parallel_apply_send_data() which made me think it can be
used from other places as well.

-- 
With Regards,
Amit Kapila.

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

04 November 2022, 14:05:21

On Friday, November 4, 2022 4:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Thu, Nov 3, 2022 at 6:36 PM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Thanks for the analysis and summary !
> >
> > I tried to implement the above idea and here is the patch set.
> >
> 
> Few comments on v42-0001
> ===========================

Thanks for the comments.

> 
> 10.
> + winfo->shared->stream_lock_id = parallel_apply_get_unique_id();
> + winfo->shared->transaction_lock_id = parallel_apply_get_unique_id();
> 
> Why can't we use xid (remote_xid) for one of these and local_xid (one generated
> by parallel apply) for the other? I was a bit worried about the local_xid because it
> will be generated only after applying the first message but the patch already
> seems to be waiting for it in parallel_apply_wait_for_xact_finish as seen in the
> below code.
> 
> +void
> +parallel_apply_wait_for_xact_finish(ParallelApplyWorkerShared *wshared)
> +{
> + /*
> + * Wait until the parallel apply worker handles the first message and
> + * set the flag to true.
> + */
> + parallel_apply_wait_for_in_xact(wshared, PARALLEL_TRANS_STARTED);
> +
> + /* Wait for the transaction lock to be released. */
> + parallel_apply_lock(wshared->transaction_lock_id);

I also considered using xid for these locks, but it seems the objsubid for the
shared object lock is 16bit while xid is 32 bit. So, I tried to generate a unique 16bit id
here. I will think more on this and maybe I need to add some comments to
explain this.

Best regards,
Hou zj

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

05 November 2022, 05:43:20

On Fri, Nov 4, 2022 at 7:35 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> On Friday, November 4, 2022 4:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Nov 3, 2022 at 6:36 PM houzj.fnst@fujitsu.com
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > Thanks for the analysis and summary !
> > >
> > > I tried to implement the above idea and here is the patch set.
> > >
> >
> > Few comments on v42-0001
> > ===========================
>
> Thanks for the comments.
>
> >
> > 10.
> > + winfo->shared->stream_lock_id = parallel_apply_get_unique_id();
> > + winfo->shared->transaction_lock_id = parallel_apply_get_unique_id();
> >
> > Why can't we use xid (remote_xid) for one of these and local_xid (one generated
> > by parallel apply) for the other?
...
...
>
> I also considered using xid for these locks, but it seems the objsubid for the
> shared object lock is 16bit while xid is 32 bit. So, I tried to generate a unique 16bit id
> here.
>

Okay, I see your point. Can we think of having a new lock tag for this
with classid, objid, objsubid for the first three fields of locktag
field? We can use a new macro SET_LOCKTAG_APPLY_TRANSACTION and a
common function to set the tag and acquire the lock. One more point
related to this is that I am suggesting classid by referring to
SET_LOCKTAG_OBJECT as that is used in the current patch but do you
think we need it for our purpose, won't subscription id and xid can
uniquely identify the tag?

-- 
With Regards,
Amit Kapila.

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

06 November 2022, 06:40:30

On Saturday, November 5, 2022 1:43 PM Amit Kapila <amit.kapila16@gmail.com>
> 
> On Fri, Nov 4, 2022 at 7:35 PM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Friday, November 4, 2022 4:07 PM Amit Kapila
> <amit.kapila16@gmail.com> wrote:
> > >
> > > On Thu, Nov 3, 2022 at 6:36 PM houzj.fnst@fujitsu.com
> > > <houzj.fnst@fujitsu.com> wrote:
> > > >
> > > > Thanks for the analysis and summary !
> > > >
> > > > I tried to implement the above idea and here is the patch set.
> > > >
> > >
> > > Few comments on v42-0001
> > > ===========================
> >
> > Thanks for the comments.
> >
> > >
> > > 10.
> > > + winfo->shared->stream_lock_id = parallel_apply_get_unique_id();
> > > + winfo->shared->transaction_lock_id =
> > > + winfo->shared->parallel_apply_get_unique_id();
> > >
> > > Why can't we use xid (remote_xid) for one of these and local_xid
> > > (one generated by parallel apply) for the other?
> ...
> ...
> >
> > I also considered using xid for these locks, but it seems the objsubid
> > for the shared object lock is 16bit while xid is 32 bit. So, I tried
> > to generate a unique 16bit id here.
> >
> 
> Okay, I see your point. Can we think of having a new lock tag for this with classid,
> objid, objsubid for the first three fields of locktag field? We can use a new
> macro SET_LOCKTAG_APPLY_TRANSACTION and a common function to set the
> tag and acquire the lock. One more point related to this is that I am suggesting
> classid by referring to SET_LOCKTAG_OBJECT as that is used in the current
> patch but do you think we need it for our purpose, won't subscription id and
> xid can uniquely identify the tag?

I agree that it could be better to have a new lock tag. Another point is that
the remote xid and Local xid could be the same in some rare cases, so I think
we might need to add another identifier to make it unique.

Maybe :
locktag_field1 : subscription oid
locktag_field2 : xid(remote or local)
locktag_field3 : 0(lock for stream block)/1(lock for transaction)

Best regards,
Hou zj

Re: Perform streaming logical transactions by background workers and parallel apply

From

Masahiko Sawada

Date:

07 November 2022, 02:56:07

On Sun, Nov 6, 2022 at 3:40 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> On Saturday, November 5, 2022 1:43 PM Amit Kapila <amit.kapila16@gmail.com>
> >
> > On Fri, Nov 4, 2022 at 7:35 PM houzj.fnst@fujitsu.com
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > On Friday, November 4, 2022 4:07 PM Amit Kapila
> > <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Thu, Nov 3, 2022 at 6:36 PM houzj.fnst@fujitsu.com
> > > > <houzj.fnst@fujitsu.com> wrote:
> > > > >
> > > > > Thanks for the analysis and summary !
> > > > >
> > > > > I tried to implement the above idea and here is the patch set.
> > > > >
> > > >
> > > > Few comments on v42-0001
> > > > ===========================
> > >
> > > Thanks for the comments.
> > >
> > > >
> > > > 10.
> > > > + winfo->shared->stream_lock_id = parallel_apply_get_unique_id();
> > > > + winfo->shared->transaction_lock_id =
> > > > + winfo->shared->parallel_apply_get_unique_id();
> > > >
> > > > Why can't we use xid (remote_xid) for one of these and local_xid
> > > > (one generated by parallel apply) for the other?
> > ...
> > ...
> > >
> > > I also considered using xid for these locks, but it seems the objsubid
> > > for the shared object lock is 16bit while xid is 32 bit. So, I tried
> > > to generate a unique 16bit id here.
> > >
> >
> > Okay, I see your point. Can we think of having a new lock tag for this with classid,
> > objid, objsubid for the first three fields of locktag field? We can use a new
> > macro SET_LOCKTAG_APPLY_TRANSACTION and a common function to set the
> > tag and acquire the lock. One more point related to this is that I am suggesting
> > classid by referring to SET_LOCKTAG_OBJECT as that is used in the current
> > patch but do you think we need it for our purpose, won't subscription id and
> > xid can uniquely identify the tag?
>
> I agree that it could be better to have a new lock tag. Another point is that
> the remote xid and Local xid could be the same in some rare cases, so I think
> we might need to add another identifier to make it unique.
>
> Maybe :
> locktag_field1 : subscription oid
> locktag_field2 : xid(remote or local)
> locktag_field3 : 0(lock for stream block)/1(lock for transaction)

Or I think we can use locktag_field2 for remote xid and locktag_field3
for local xid.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

07 November 2022, 03:58:46

On Mon, Nov 7, 2022 at 8:26 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Sun, Nov 6, 2022 at 3:40 PM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Saturday, November 5, 2022 1:43 PM Amit Kapila <amit.kapila16@gmail.com>
> > >
> > > On Fri, Nov 4, 2022 at 7:35 PM houzj.fnst@fujitsu.com
> > > <houzj.fnst@fujitsu.com> wrote:
> > > >
> > > > On Friday, November 4, 2022 4:07 PM Amit Kapila
> > > <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > > On Thu, Nov 3, 2022 at 6:36 PM houzj.fnst@fujitsu.com
> > > > > <houzj.fnst@fujitsu.com> wrote:
> > > > > >
> > > > > > Thanks for the analysis and summary !
> > > > > >
> > > > > > I tried to implement the above idea and here is the patch set.
> > > > > >
> > > > >
> > > > > Few comments on v42-0001
> > > > > ===========================
> > > >
> > > > Thanks for the comments.
> > > >
> > > > >
> > > > > 10.
> > > > > + winfo->shared->stream_lock_id = parallel_apply_get_unique_id();
> > > > > + winfo->shared->transaction_lock_id =
> > > > > + winfo->shared->parallel_apply_get_unique_id();
> > > > >
> > > > > Why can't we use xid (remote_xid) for one of these and local_xid
> > > > > (one generated by parallel apply) for the other?
> > > ...
> > > ...
> > > >
> > > > I also considered using xid for these locks, but it seems the objsubid
> > > > for the shared object lock is 16bit while xid is 32 bit. So, I tried
> > > > to generate a unique 16bit id here.
> > > >
> > >
> > > Okay, I see your point. Can we think of having a new lock tag for this with classid,
> > > objid, objsubid for the first three fields of locktag field? We can use a new
> > > macro SET_LOCKTAG_APPLY_TRANSACTION and a common function to set the
> > > tag and acquire the lock. One more point related to this is that I am suggesting
> > > classid by referring to SET_LOCKTAG_OBJECT as that is used in the current
> > > patch but do you think we need it for our purpose, won't subscription id and
> > > xid can uniquely identify the tag?
> >
> > I agree that it could be better to have a new lock tag. Another point is that
> > the remote xid and Local xid could be the same in some rare cases, so I think
> > we might need to add another identifier to make it unique.
> >
> > Maybe :
> > locktag_field1 : subscription oid
> > locktag_field2 : xid(remote or local)
> > locktag_field3 : 0(lock for stream block)/1(lock for transaction)
>
> Or I think we can use locktag_field2 for remote xid and locktag_field3
> for local xid.
>

We can do that way as well but OTOH, I think for the local
transactions we don't need subscription oid, so field1 could be
InvalidOid and field2 will be xid of local xact. Won't that be better?

-- 
With Regards,
Amit Kapila.

Re: Perform streaming logical transactions by background workers and parallel apply

From

Masahiko Sawada

Date:

07 November 2022, 04:32:17

On Mon, Nov 7, 2022 at 12:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Nov 7, 2022 at 8:26 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Sun, Nov 6, 2022 at 3:40 PM houzj.fnst@fujitsu.com
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > On Saturday, November 5, 2022 1:43 PM Amit Kapila <amit.kapila16@gmail.com>
> > > >
> > > > On Fri, Nov 4, 2022 at 7:35 PM houzj.fnst@fujitsu.com
> > > > <houzj.fnst@fujitsu.com> wrote:
> > > > >
> > > > > On Friday, November 4, 2022 4:07 PM Amit Kapila
> > > > <amit.kapila16@gmail.com> wrote:
> > > > > >
> > > > > > On Thu, Nov 3, 2022 at 6:36 PM houzj.fnst@fujitsu.com
> > > > > > <houzj.fnst@fujitsu.com> wrote:
> > > > > > >
> > > > > > > Thanks for the analysis and summary !
> > > > > > >
> > > > > > > I tried to implement the above idea and here is the patch set.
> > > > > > >
> > > > > >
> > > > > > Few comments on v42-0001
> > > > > > ===========================
> > > > >
> > > > > Thanks for the comments.
> > > > >
> > > > > >
> > > > > > 10.
> > > > > > + winfo->shared->stream_lock_id = parallel_apply_get_unique_id();
> > > > > > + winfo->shared->transaction_lock_id =
> > > > > > + winfo->shared->parallel_apply_get_unique_id();
> > > > > >
> > > > > > Why can't we use xid (remote_xid) for one of these and local_xid
> > > > > > (one generated by parallel apply) for the other?
> > > > ...
> > > > ...
> > > > >
> > > > > I also considered using xid for these locks, but it seems the objsubid
> > > > > for the shared object lock is 16bit while xid is 32 bit. So, I tried
> > > > > to generate a unique 16bit id here.
> > > > >
> > > >
> > > > Okay, I see your point. Can we think of having a new lock tag for this with classid,
> > > > objid, objsubid for the first three fields of locktag field? We can use a new
> > > > macro SET_LOCKTAG_APPLY_TRANSACTION and a common function to set the
> > > > tag and acquire the lock. One more point related to this is that I am suggesting
> > > > classid by referring to SET_LOCKTAG_OBJECT as that is used in the current
> > > > patch but do you think we need it for our purpose, won't subscription id and
> > > > xid can uniquely identify the tag?
> > >
> > > I agree that it could be better to have a new lock tag. Another point is that
> > > the remote xid and Local xid could be the same in some rare cases, so I think
> > > we might need to add another identifier to make it unique.
> > >
> > > Maybe :
> > > locktag_field1 : subscription oid
> > > locktag_field2 : xid(remote or local)
> > > locktag_field3 : 0(lock for stream block)/1(lock for transaction)
> >
> > Or I think we can use locktag_field2 for remote xid and locktag_field3
> > for local xid.
> >
>
> We can do that way as well but OTOH, I think for the local
> transactions we don't need subscription oid, so field1 could be
> InvalidOid and field2 will be xid of local xact. Won't that be better?

This would work. But I'm a bit concerned that we cannot identify which
subscriptions the lock belongs to when checking pg_locks view.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

07 November 2022, 05:02:26

On Mon, Nov 7, 2022 at 10:02 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Nov 7, 2022 at 12:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > > > I agree that it could be better to have a new lock tag. Another point is that
> > > > the remote xid and Local xid could be the same in some rare cases, so I think
> > > > we might need to add another identifier to make it unique.
> > > >
> > > > Maybe :
> > > > locktag_field1 : subscription oid
> > > > locktag_field2 : xid(remote or local)
> > > > locktag_field3 : 0(lock for stream block)/1(lock for transaction)
> > >
> > > Or I think we can use locktag_field2 for remote xid and locktag_field3
> > > for local xid.
> > >
> >
> > We can do that way as well but OTOH, I think for the local
> > transactions we don't need subscription oid, so field1 could be
> > InvalidOid and field2 will be xid of local xact. Won't that be better?
>
> This would work. But I'm a bit concerned that we cannot identify which
> subscriptions the lock belongs to when checking pg_locks view.
>

Fair point. I think if the user wants, she can join with
pg_stat_subscription based on PID and find the corresponding
subscription. However, if we want to identify everything via pg_locks
then I think we should also mention classid or database id as field1.
So, it would look like: field1: (pg_subscription's oid or current db
id); field2: OID of subscription in pg_subscription; field3: local or
remote xid; field4: 0/1 to differentiate between remote and local xid.

-- 
With Regards,
Amit Kapila.

Re: Perform streaming logical transactions by background workers and parallel apply

From

Peter Smith

Date:

07 November 2022, 08:16:37

Here are my review comments for v42-0001

======

1. General.

Please take the time to process all new code comments using a
grammar/spelling checker (e.g. simply cut/paste them into MSWord or
Grammarly or any other tool of your choice as a quick double-check)
*before* posting the patches; too many of my review comments are about
code comments and it's taking a long time to keep cycling through
reporting/fixing/confirming comments for every patch version  -
whereas it probably would take hardly any time to make the same
spelling/grammar corrections up-front.


======

.../replication/logical/applyparallelworker.c

2. ParallelApplyLockids

This seems like a bogus name. Code is using this in a way that means
the subset of lockED ids. Not the list of all the lock ids.

OTHO, having another list of ALL lock-ids might be useful (for
detecting unique ids) if you are able to maintain such a list safely.

~~~

3. parallel_apply_can_start

+
+ if (switching_to_serialize)
+ return false;

This should have an explanatory comment.

~~~

4. parallel_apply_start_worker

+ /* Check if the transaction in that worker has been finished. */
+ xact_state = parallel_apply_get_xact_state(tmp_winfo->shared);
+ if (xact_state == PARALLEL_TRANS_FINISHED)

"has been finished." -> "has finished."

~~~

5.

+ /*
+ * Set the xact_state flag in the leader instead of the
+ * parallel apply worker to avoid the race condition where the leader has
+ * already started waiting for the parallel apply worker to finish
+ * processing the transaction while the child process has not yet
+ * processed the first STREAM_START and has not set the
+ * xact_state to true.
+ */
+ SpinLockAcquire(&winfo->shared->mutex);
+ winfo->shared->xact_state = PARALLEL_TRANS_UNKNOWN;
+ winfo->shared->xid = xid;
+ winfo->shared->fileset_valid = false;
+ winfo->shared->partial_sent_message = false;
+ SpinLockRelease(&winfo->shared->mutex);

This code comment is stale, because xact_state is no longer a "flag",
nor does "set the xact_state to true." make sense anymore.

~~~

6. parallel_apply_free_worker

+ /*
+ * Don't free the worker if the transaction in the worker is still in
+ * progress. This could happend as we don't wait for transaction rollback
+ * to finish.
+ */
+ if (parallel_apply_get_xact_state(winfo->shared) < PARALLEL_TRANS_FINISHED)
+ return;

6a.
typo "happend"

~

6b.
Saying "< PARALLEL_TRANS_FINISHED" seems kind of risky because not it
is assuming a specific ordering of those enums which has never been
mentioned before. I think it will be safer to say "!=
PARALLEL_TRANS_FINISHED" instead. Alternatively, if the enum order is
important then it must be documented with the typedef so that nobody
changes it.

~~~

7.

+ ParallelApplyWorkersList = list_delete_ptr(ParallelApplyWorkersList,
+    winfo);

Unnecessary wrapping

~~~

8.

+ /*
+ * Resend the pending message to parallel apply worker to cleanup the
+ * queue. Note that parallel apply worker will just ignore this message
+ * as it has already handled this message while applying spooled
+ * messages.
+ */
+ result = shm_mq_send(winfo->mq_handle, strlen(winfo->pending_msg),
+ winfo->pending_msg, false, true);

If I understand this logic it seems a bit hacky. From the comment, it
seems you are resending a message that you know/expect to be ignored
simply to make it disappear. (??). Isn't there some other way to clear
the pending message without requiring a bogus send?

~~~

9. parallel_apply_spooled_messages

+
+static void
+parallel_apply_spooled_messages(void)

Missing function comment

~~~

10.

+parallel_apply_spooled_messages(void)
+{
+ bool fileset_valid = false;
+
+ /*
+ * Check if changes has been serialized to disk. if so, read and
+ * apply them.
+ */
+ SpinLockAcquire(&MyParallelShared->mutex);
+ fileset_valid = MyParallelShared->fileset_valid;
+ SpinLockRelease(&MyParallelShared->mutex);

The variable assignment in the declaration seems unnecessary.

~~~

11.

+ /*
+ * Check if changes has been serialized to disk. if so, read and
+ * apply them.
+ */
+ SpinLockAcquire(&MyParallelShared->mutex);
+ fileset_valid = MyParallelShared->fileset_valid;
+ SpinLockRelease(&MyParallelShared->mutex);

"has been" -> "have been"

~~~

12.

+ apply_spooled_messages(&MyParallelShared->fileset,
+    MyParallelShared->xid,
+    InvalidXLogRecPtr);
+ parallel_apply_set_fileset(MyParallelShared, false);

parallel_apply_set_fileset() is a confusing function name. IMO this
logic would be better split into 2 smaller functions:
- parallel_apply_set_fileset_valid()
- parallel_apply_set_fileset_invalid()

~~~

13. parallel_apply_get_unique_id

+/*
+ * Returns the unique id among all parallel apply workers in the subscriber.
+ */
+static uint16
+parallel_apply_get_unique_id()

The meaning of that comment and the purpose of this function are not
entirely clear... e.g. I had to read the code to figure out what the
comment is describing.

~~~

14.

The function seems to be written in some way that scans all known ids
looking for one that does not match. I wonder if it might be easier to
just assign some auto-incrementing static instead of having to scan
for uniqueness always. Since the pool of apply workers is limited is
that kind of ID ever going to come close to running out?

Alternatively, see also comment #2 for a different way to know what
lockids are present.

~~~

15.

winfo->shared->stream_lock_id = parallel_apply_get_unique_id();
winfo->shared->transaction_lock_id = parallel_apply_get_unique_id();
It somehow feels clunky to be calling this
parallel_apply_get_unique_id() like this to scan all the same things 2
times. If you are going to keep this scanning logic then at least the
function should be changed to return a PAIR of lock-ids so you only;y
need to do 1x scan instead of 2x scan.
~~~

16. parallel_apply_send_data

+/*
+ * Send the data to the specified parallel apply worker via
shared-memory queue.
+ */
+void
+parallel_apply_send_data(ParallelApplyWorkerInfo *winfo, Size nbytes,
+ const void *data)

The function comment needs more detail to explain the purpose of, and
how the thresholds work.

~~~

17. parallel_apply_wait_for_xact_finish

+/*
+ * Wait until the parallel apply worker's transaction finishes.
+ */
+void
+parallel_apply_wait_for_xact_finish(ParallelApplyWorkerShared *wshared)

I think this comment needs lots more details because the
implementation seems to be doing a lot more than just waiting for the
start to become "finished" - e.g. it seems to be waiting for it to
transition through the other stages as well...

~~~

18.

The boolean flag was changed to enum states so all these comments
mentioning "flag" are stale and need to be reworded/rewritten.

18a.
+ /*
+ * Wait until the parallel apply worker handles the first message and
+ * set the flag to true.
+ */

Update this comment

~

18b.
+ /*
+ * Wait until the flag becomes false in case the lock was released because
+ * of failure while applying.
+ */

Update this comment

~~~

19. parallel_apply_wait_for_in_xact

+/*
+ * Wait until the parallel apply worker's xact_state flag becomes
+ * the same as in_xact.
+ */
+static void
+parallel_apply_wait_for_in_xact(ParallelApplyWorkerShared *wshared,
+ ParallelTransState xact_state)

SUGGESTION
Wait until the parallel apply worker's transaction state becomes the
same as in_xact.

~~~

20.

+ /* Stop if the flag becomes the same as in_xact. */
+ if (parallel_apply_get_xact_state(wshared) >= xact_state)
+ break;

20a.
"flag" -> "transaction state",

~

20b.
This code uses >= comparison which means a strict order of the enum
values is assumed. So this order MUST be documented in the enum
typedef.

~~~

21. parallel_apply_set_xact_state

+/*
+ * Set the xact_state flag for the given parallel apply worker.
+ */
+void
+parallel_apply_set_xact_state(ParallelApplyWorkerShared *wshared,
+   ParallelTransState xact_state)

SUGGESTION
Set an enum indicating the transaction state for the given parallel
apply worker.

~~~

22. parallel_apply_get_xact_state

/*
 * Get the xact_state flag for the given parallel apply worker.
 */
static ParallelTransState
parallel_apply_get_xact_state(ParallelApplyWorkerShared *wshared)

SUGGESTION
Get an enum indicating the transaction state for the given parallel
apply worker.

~~~

23. parallel_apply_set_fileset


+/*
+ * Set the fileset_valid flag and fileset for the given parallel apply worker.
+ */
+void
+parallel_apply_set_fileset(ParallelApplyWorkerShared *wshared, bool
fileset_valid)

As mentioned elsewhere (#12 above) I think would be better to split
this into 2 functions.

~~~

24. parallel_apply_lock/unlock

24a.
+/* Helper function to release a lock with lockid */
SUGGESTION
Helper function to release a lock identified by lockid.

~

24b.
+/* Helper function to take a lock with lockid */
SUGGESTION
Helper function to acquire a lock identified by lockid.

~

24c.
+/* Helper function to release a lock with lockid */
+void
+parallel_apply_lock(uint16 lockid)
...
+/* Helper function to take a lock with lockid */
+void
+parallel_apply_unlock(uint16 lockid)

Aren't those function comments around the wrong way?


======

src/backend/replication/logical/worker.c

25. File header comment

+ * The dynamic shared memory segment will contain (a) a shm_mq that can be used
+ * to send changes in the transaction from leader apply worker to parallel
+ * apply worker (b) another shm_mq that can be used to send errors (and other
+ * messages reported via elog/ereport) from the parallel apply worker to leader
+ * apply worker (c) necessary information to be shared among parallel apply
+ * workers and leader apply worker (i.e. the member in
+ * ParallelApplyWorkerShared).

"the member in ParallelApplyWorkerShared" -> "the members of
ParallelApplyWorkerShared"

~~~

26.

Shouldn't that comment have something to say about the
deadlock-detection design?

~~~

27. TransApplyAction

+typedef enum
 {
- LogicalRepMsgType command; /* 0 if invalid */
- LogicalRepRelMapEntry *rel;
-
- /* Remote node information */
- int remote_attnum; /* -1 if invalid */
- TransactionId remote_xid;
- XLogRecPtr finish_lsn;
- char    *origin_name;
-} ApplyErrorCallbackArg;
-
-static ApplyErrorCallbackArg apply_error_callback_arg =
+ /* The action for non-streaming transactions. */
+ TRANS_LEADER_APPLY,
+
+ /* Actions for streaming transactions. */
+ TRANS_LEADER_SERIALIZE,
+ TRANS_LEADER_PARTIAL_SERIALIZE,
+ TRANS_LEADER_SEND_TO_PARALLEL,
+ TRANS_PARALLEL_APPLY
+} TransApplyAction;

27a.
A new enum TRANS_LEADER_PARTIAL_SERIALIZE was added, but the
explanatory comment for it is missing

~

27b.
In fact, this new TRANS_LEADER_PARTIAL_SERIALIZE is used in many
places with no comments to explain what it is for.

~~~

28. handle_streamed_transaction

 static bool
 handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)
 {
- TransactionId xid;
+ TransactionId current_xid;
+ ParallelApplyWorkerInfo *winfo;
+ TransApplyAction apply_action;
+ StringInfoData origin_msg;
+
+ apply_action = get_transaction_apply_action(stream_xid, &winfo);

  /* not in streaming mode */
- if (!in_streamed_transaction)
+ if (apply_action == TRANS_LEADER_APPLY)
  return false;

- Assert(stream_fd != NULL);
  Assert(TransactionIdIsValid(stream_xid));

+ origin_msg = *s;

28a.
There are no comments explaining what this
TRANS_LEADER_PARTIAL_SERIALIZE is doing. SO I cannot tell if
'origin_msg' is a meaningful name, or does that mean to say
'original_msg' ?

~

28b.
Why not assign it at the declaration, the same as
apply_handle_stream_prepare does?

~~~

29. apply_handle_stream_prepare

+ case TRANS_LEADER_PARTIAL_SERIALIZE:

Seems like there is a missing explanation of what this partial
serialize logic is doing.

~~~

30.

+ case TRANS_PARALLEL_APPLY:
+ parallel_apply_replorigin_setup();
+
+ /* Unlock all the shared object lock at transaction end. */
+ parallel_apply_unlock(MyParallelShared->stream_lock_id);
+
+ if (stream_fd)
+ BufFileClose(stream_fd);

Should be some explanatory comment, on what's going on here with the
stream_fd. E.g. how does it get to be non-NULL and why you do not set
it again to NULL after the BufFileClose.

~~~

31.

 /*
+ * Handle STREAM START message when the transaction was spilled to disk.
+ *
+ * Inintialize fileset if not yet and open the file.
+ */
+void
+serialize_stream_start(TransactionId xid, bool first_segment)

Typo "Inintialize" -> "Initialize"

Looks like missing words in the comment.

SUGGESTION
Initialize fileset (if not already done), and open the file.

~~~


32. apply_handle_stream_start

- if (in_streamed_transaction)
+ if (!switching_to_serialize && in_streamed_transaction)
  ereport(ERROR,
  (errcode(ERRCODE_PROTOCOL_VIOLATION),
  errmsg_internal("duplicate STREAM START message")));

Somehow, I think this condition seems more natural if written the
other way around:

SUGGESTION
if (in_streamed_transaction && !switching_to_serialize)

~~~

33.

+ /*
+ * Increment the number of message waiting to be processed by
+ * parallel apply worker.
+ */
+ pg_atomic_add_fetch_u32(&(winfo->shared->left_message), 1);

33a.
"of message" -> "of messages".

~

33b.
The extra &() parens are not useful.

This same syntax is repeated in all the calls to that atomic function
so please search/fix all the others too...

~

33c.
The member name 'left_message' seems not a very good name. How about
'pending_message_count' or 'n_unprocessed_messages' or
'n_messages_remaining' or anything else more informative?

~~~

34. apply_handle_stream_abort

+static void
+apply_handle_stream_abort(StringInfo s)
+{
+ TransactionId xid;
+ TransactionId subxid;
+ LogicalRepStreamAbortData abort_data;
+ ParallelApplyWorkerInfo *winfo;
+ TransApplyAction apply_action;
+ StringInfoData origin_msg = *s;

I'm unsure about that 'origin_msg' variable. Should that be called
'original_msg'?

~~~

35.

+ if (subxid == xid)

There are multiple parts of this logic that are doing (subxid == xid),
so it might be better to assign that to a meaningful variable name
instead of the repeated comparisons.

36.

+ * The file was deleted if aborted the whole transaction, so
+ * create it again in this case.

English? Missing words?

~~~

37.

+ /*
+ * Increment the number of message waiting to be processed by
+ * parallel apply worker.
+ */

"message" -> "messages"

~~~

38.

+ /*
+ * If there is no message left, wait for the leader to release the
+ * lock and send more messages.
+ */
+ if (xid != subxid &&
+ pg_atomic_sub_fetch_u32(&(MyParallelShared->left_message), 1) == 0)
+ parallel_apply_lock(MyParallelShared->stream_lock_id);

The comment says "wait for the leader"... but the comment seems
misleading - there is no waiting happening here.

~~~

39. apply_spooled_messages

+
 /*
  * Common spoolfile processing.
  */
-static void
-apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
+void
+apply_spooled_messages(FileSet *stream_fileset, TransactionId xid,
+    XLogRecPtr lsn)

Spurious extra blank line above this function.

~~~

40.

- fd = BufFileOpenFileSet(MyLogicalRepWorker->stream_fileset, path, O_RDONLY,
+ fd = BufFileOpenFileSet(stream_fileset, path, O_RDONLY,
  false);

Unnecessary wrapping.

~~~

41.

+ fd = BufFileOpenFileSet(stream_fileset, path, O_RDONLY,
  false);
+ stream_fd = fd;

Is it still meaningful to have the local 'fd' variable? Might as well
just use 'stream_fd' instead now, right?

~~~

42.

+ /*
+ * Break the loop if parallel apply worker have finished applying the
+ * transaction. The parallel apply worker should have close the file
+ * before committing.
+ */

English?

"if parallel" -> "if the parallel"

"have finished" -> "has finished"

"should have close" -> "should have closed"

~~~

43. apply_handle_stream_commit

  LogicalRepCommitData commit_data;
+ ParallelApplyWorkerInfo *winfo;
+ TransApplyAction apply_action;
+ StringInfoData origin_msg = *s

I'm unsure about that 'origin_msg' variable. Should that be called
'original_msg' ?

~~~


44. stream_write_message

+ * stream_write_message
+ *   Serialize the message that are not in a streaming block to a file.
+ */
+static void
+stream_write_message(TransactionId xid, char action, StringInfo s,
+ bool create_file)


44a.
This logic seems new, but the function comment sounds strange
(English/typos?) and it is not giving enough details about when is
this file, and for what purpose are we writing to it?

~

44b.
If this is always written to a file, then wouldn't a better function
name be something including the word "serialize" - e.g.
serialize_message()?


======

src/backend/replication/logical/launcher.c

45. logicalrep_worker_onexit

+ /*
+ * Release all the session level lock that could be held in parallel apply
+ * mode.
+ */
+ LockReleaseAll(DEFAULT_LOCKMETHOD, true);

"the session level lock" -> "session level locks"

======

src/include/replication/worker_internal.h

46. ParallelApplyWorkerShared

+ /*
+ * Flag used to ensure commit ordering.
+ *
+ * The parallel apply worker will set it to false after handling the
+ * transaction finish commands while the apply leader will wait for it to
+ * become false before proceeding in transaction finish commands (e.g.
+ * STREAM_COMMIT/STREAM_ABORT/STREAM_PREPARE).
+ */
+ ParallelTransState xact_state;

The comment has gone stale because this member is not a boolean flag
anymore, so saying "will set it to false" is wrong...

~~~

47.

+ /* Unique identifiers in the current subscription that used to lock. */
+ uint16 stream_lock_id;
+ uint16 transaction_lock_id;

Comment English?

~~~

48.

+ pg_atomic_uint32 left_message;

Needs explanatory comment.

~~~

49.

+ /* Whether there is partially sent message left in the queue. */
+ bool partial_sent_message;

Comment English?

~~~

50.

+ /*
+ * Don't use SharedFileSet here because the fileset is shared by the leader
+ * worker and the fileset in leader need to survive after releasing the
+ * shared memory so that the leader can re-use the fileset for next
+ * streaming transaction.
+ */
+ bool fileset_valid;
+ FileSet fileset;

The comment here seems to need some more work because it is saying
more about what it *isn't*, rather than what it *is*.

Something like:

The 'fileset' is used for....
The 'fileset' is only valid to use when the accompanying fileset_valid
flag is true...
NOTE - We cannot use a SharedFileSet here because....

Also, fix typos "need to survive" -> "needs to survive".

Also, it may be better to refer to the "leader apply worker" by its
full name instead of just "leader".

~~~

51. typedef struct ParallelApplyWorkerInfo

+ bool serialize_changes;

Needs explanatory comment.

~~

52.

+ /*
+ * Used to save the message that was only partially sent to parallel apply
+ * worker.
+ */
+ char *pending_msg;


Some information seems missing because this comment does not have
enough detail to know what it means - e.g. what is a partially sent
message?


------
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Perform streaming logical transactions by background workers and parallel apply

From

Masahiko Sawada

Date:

07 November 2022, 10:17:32

On Thu, Nov 3, 2022 at 10:06 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> On Wednesday, November 2, 2022 10:50 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Oct 24, 2022 at 8:42 PM Masahiko Sawada
> > <sawada.mshk@gmail.com> wrote:
> > >
> > > On Wed, Oct 12, 2022 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com>
> > wrote:
> > > >
> > > > On Tue, Oct 11, 2022 at 5:52 AM Masahiko Sawada
> > <sawada.mshk@gmail.com> wrote:
> > > > >
> > > > > On Fri, Oct 7, 2022 at 2:00 PM Amit Kapila <amit.kapila16@gmail.com>
> > wrote:
> > > > > >
> > > > > > About your point that having different partition structures for
> > > > > > publisher and subscriber, I don't know how common it will be once we
> > > > > > have DDL replication. Also, the default value of
> > > > > > publish_via_partition_root is false which doesn't seem to indicate
> > > > > > that this is a quite common case.
> > > > >
> > > > > So how can we consider these concurrent issues that could happen only
> > > > > when streaming = 'parallel'? Can we restrict some use cases to avoid
> > > > > the problem or can we have a safeguard against these conflicts?
> > > > >
> > > >
> > > > Yeah, right now the strategy is to disallow parallel apply for such
> > > > cases as you can see in *0003* patch.
> > >
> > > Tightening the restrictions could work in some cases but there might
> > > still be coner cases and it could reduce the usability. I'm not really
> > > sure that we can ensure such a deadlock won't happen with the current
> > > restrictions. I think we need something safeguard just in case. For
> > > example, if the leader apply worker is waiting for a lock acquired by
> > > its parallel worker, it cancels the parallel worker's transaction,
> > > commits its transaction, and restarts logical replication. Or the
> > > leader can log the deadlock to let the user know.
> > >
> >
> > As another direction, we could make the parallel apply feature robust
> > if we can detect deadlocks that happen among the leader worker and
> > parallel workers. I'd like to summarize the idea discussed off-list
> > (with Amit, Hou-San, and Kuroda-San) for discussion. The basic idea is
> > that when the leader worker or parallel worker needs to wait for
> > something (eg. transaction completion, messages) we use lmgr
> > functionality so that we can create wait-for edges and detect
> > deadlocks in lmgr.
> >
> > For example, a scenario where a deadlock occurs is the following:
> >
> > [Publisher]
> > create table tab1(a int);
> > create publication pub for table tab1;
> >
> > [Subcriber]
> > creat table tab1(a int primary key);
> > create subscription sub connection 'port=10000 dbname=postgres'
> > publication pub with (streaming = parallel);
> >
> > TX1:
> > BEGIN;
> > INSERT INTO tab1 SELECT i FROM generate_series(1, 5000) s(i); -- streamed
> >     Tx2:
> >     BEGIN;
> >     INSERT INTO tab1 SELECT i FROM generate_series(1, 5000) s(i); -- streamed
> >     COMMIT;
> > COMMIT;
> >
> > Suppose a parallel apply worker (PA-1) is executing TX-1 and the
> > leader apply worker (LA) is executing TX-2 concurrently on the
> > subscriber. Now, LA is waiting for PA-1 because of the unique key of
> > tab1 while PA-1 is waiting for LA to send further messages. There is a
> > deadlock between PA-1 and LA but lmgr cannot detect it.
> >
> > One idea to resolve this issue is that we have LA acquire a session
> > lock on a shared object (by LockSharedObjectForSession()) and have
> > PA-1 wait on the lock before trying to receive messages. IOW,  LA
> > acquires the lock before sending STREAM_STOP and releases it if
> > already acquired before sending STREAM_START, STREAM_PREPARE and
> > STREAM_COMMIT. For PA-1, it always needs to acquire the lock after
> > processing STREAM_STOP and then release immediately after acquiring
> > it. That way, when PA-1 is waiting for LA, we can have a wait-edge
> > from PA-1 to LA in lmgr, which will make a deadlock in lmgr like:
> >
> > LA (waiting to acquire lock) -> PA-1 (waiting to acquire the shared
> > object) -> LA
> >
> > We would need the shared objects per parallel apply worker.
> >
> > After detecting a deadlock, we can restart logical replication with
> > temporarily disabling the parallel apply, which is done by 0005 patch.
> >
> > Another scenario is similar to the previous case but TX-1 and TX-2 are
> > executed by two parallel apply workers (PA-1 and PA-2 respectively).
> > In this scenario, PA-2 is waiting for PA-1 to complete its transaction
> > while PA-1 is waiting for subsequent input from LA. Also, LA is
> > waiting for PA-2 to complete its transaction in order to preserve the
> > commit order. There is a deadlock among three processes but it cannot
> > be detected in lmgr because the fact that LA is waiting for PA-2 to
> > complete its transaction doesn't appear in lmgr (see
> > parallel_apply_wait_for_xact_finish()). To fix it, we can use
> > XactLockTableWait() instead.
> >
> > However, since XactLockTableWait() considers PREPARED TRANSACTION as
> > still in progress, probably we need a similar trick as above in case
> > where a transaction is prepared. For example, suppose that TX-2 was
> > prepared instead of committed in the above scenario, PA-2 acquires
> > another shared lock at START_STREAM and releases it at
> > STREAM_COMMIT/PREPARE. LA can wait on the lock.
> >
> > Yet another scenario where LA has to wait is the case where the shm_mq
> > buffer is full. In the above scenario (ie. PA-1 and PA-2 are executing
> > transactions concurrently), if  the shm_mq buffer between LA and PA-2
> > is full, LA has to wait to send messages, and this wait doesn't appear
> > in lmgr. To fix it, probably we have to use non-blocking write and
> > wait with a timeout. If timeout is exceeded, the LA will write to file
> > and indicate PA-2 that it needs to read file for remaining messages.
> > Then LA will start waiting for commit which will detect deadlock if
> > any.
> >
> > If we can detect deadlocks by having such a functionality or some
> > other way then we don't need to tighten the restrictions of subscribed
> > tables' schemas etc.
>
> Thanks for the analysis and summary !
>
> I tried to implement the above idea and here is the patch set. I have done some
> basic tests for the new codes and it work fine.

Thank you for updating the patches!

Here are comments on v42-0001:

We have the following three similar name functions regarding to
starting a new parallel apply worker:

parallel_apply_start_worker()
parallel_apply_setup_worker()
parallel_apply_setup_dsm()

It seems to me that we can somewhat merge them since
parallel_apply_setup_worker() and parallel_apply_setup_dsm() have only
one caller.

---
+/*
+ * Extract the streaming mode value from a DefElem.  This is like
+ * defGetBoolean() but also accepts the special value of "parallel".
+ */
+char
+defGetStreamingMode(DefElem *def)

It's a bit unnatural to have this function in define.c since other
functions in this file for primitive data types. How about having it
in subscription.c?

---
         /*
          * Exit if any parameter that affects the remote connection
was changed.
-         * The launcher will start a new worker.
+         * The launcher will start a new worker, but note that the
parallel apply
+         * worker may or may not restart depending on the value of
the streaming
+         * option and whether there will be a streaming transaction.

In which case does the parallel apply worker don't restart even if the
streaming option has been changed?

---
I think we should explain somewhere the idea of using locks for
synchronization between leader and worker. Maybe can we do that with
sample workload in new README file?

---
in parallel_apply_send_data():

+                result = shm_mq_send(winfo->mq_handle, nbytes, data,
true, true);
+
+                if (result == SHM_MQ_SUCCESS)
+                        break;
+                else if (result == SHM_MQ_DETACHED)
+                        ereport(ERROR,
+
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                                         errmsg("could not send data
to shared-memory queue")))
+
+                Assert(result == SHM_MQ_WOULD_BLOCK);
+
+                if (++retry >= CHANGES_THRESHOLD)
+                {
+                        MemoryContext oldcontext;
+                        StringInfoData msg;
+                        TimestampTz now = GetCurrentTimestamp();
+
+                        if (startTime == 0)
+                                startTime = now;
+
+                        if (!TimestampDifferenceExceeds(startTime,
now, SHM_SEND_TIMEOUT_MS))
+                                continue;

IIUC since the parallel worker retries to send data without waits the
'retry' will get larger than CHANGES_THRESHOLD in a very short time.
But the worker waits at least for SHM_SEND_TIMEOUT_MS to spool data
regardless of 'retry' count. Don't we need to nap somewhat and why do
we need CHANGES_THRESHOLD?

---
+/*
+ * Wait until the parallel apply worker's xact_state flag becomes
+ * the same as in_xact.
+ */
+static void
+parallel_apply_wait_for_in_xact(ParallelApplyWorkerShared *wshared,
+
ParallelTransState xact_state)
+{
+        for (;;)
+        {
+                /* Stop if the flag becomes the same as in_xact. */

What do you mean by 'in_xact' here?

---
I got the error "ERROR:  invalid logical replication message type ""
with the following scenario:

1. Stop the PA by sending SIGSTOP signal.
2. Stream a large transaction so that the LA spools changes to the file for PA.
3. Resume the PA by sending SIGCONT signal.
4. Stream another large transaction.

---
* On publisher (with logical_decoding_work_mem = 64kB)
begin;
insert into t select generate_series(1, 1000);
rollback;
begin;
insert into t select generate_series(1, 1000);
rollback;

I got the following error:

ERROR:  hash table corrupted
CONTEXT:  processing remote data for replication origin "pg_16393"
during message type "STREAM START" in transaction 734

---
IIUC the changes for worker.c in 0001 patch includes both changes:

1. apply worker takes action based on the apply_action returned by
get_transaction_apply_action() per message (or streamed chunk).
2. apply worker supports handling parallel apply workers.

It seems to me that (1) is a rather refactoring patch, so probably we
can do that in a separate patch so that we can make the patches
smaller.

---
postgres(1:2831190)=# \dRs+ test_sub1
List of subscriptions
-[ RECORD 1 ]------+--------------------------
Name               | test_sub1
Owner              | masahiko
Enabled            | t
Publication        | {test_pub1}
Binary             | f
Streaming          | p
Two-phase commit   | d
Disable on error   | f
Origin             | any
Synchronous commit | off
Conninfo           | port=5551 dbname=postgres
Skip LSN           | 0/0

It's better to show 'on', 'off' or 'streaming' rather than one character.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

RE: Perform streaming logical transactions by background workers and parallel apply

From

"Hayato Kuroda (Fujitsu)"

Date:

07 November 2022, 11:42:53

Dear Hou,

The followings are my comments. I want to consider the patch more, but I sent it once.

===
worker.c

01. typedef enum TransApplyAction

```
/*
 * What action to take for the transaction.
 *
 * TRANS_LEADER_APPLY means that we are in the leader apply worker and changes
 * of the transaction are applied directly in the worker.
 *
 * TRANS_LEADER_SERIALIZE means that we are in the leader apply worker or table
 * sync worker. Changes are written to temporary files and then applied when
 * the final commit arrives.
 *
 * TRANS_LEADER_SEND_TO_PARALLEL means that we are in the leader apply worker
 * and need to send the changes to the parallel apply worker.
 *
 * TRANS_PARALLEL_APPLY means that we are in the parallel apply worker and
 * changes of the transaction are applied directly in the worker.
 */
```

TRANS_LEADER_PARTIAL_SERIALIZE should be listed in.

02. handle_streamed_transaction()

```
+       StringInfoData  origin_msg;
...
+       origin_msg = *s;
...
+                               /* Write the change to the current file */
+                               stream_write_change(action,
+                                                                       apply_action == TRANS_LEADER_SERIALIZE ?
+                                                                       s : &origin_msg);
```

I'm not sure why origin_msg is needed. Can we remove the conditional operator?


03. apply_handle_stream_start()

```
+ * XXX We can avoid sending pairs of the START/STOP messages to the parallel
+ * worker because unlike apply worker it will process only one transaction at a
+ * time. However, it is not clear whether any optimization is worthwhile
+ * because these messages are sent only when the logical_decoding_work_mem
+ * threshold is exceeded.
```

This comment should be modified because PA must acquire and release locks at that time.


04. apply_handle_stream_prepare()

```
+                       /*
+                        * After sending the data to the parallel apply worker, wait for
+                        * that worker to finish. This is necessary to maintain commit
+                        * order which avoids failures due to transaction dependencies and
+                        * deadlocks.
+                        */
+                       parallel_apply_wait_for_xact_finish(winfo->shared);
```

Here seems not to be correct. LA may not send data but spill changes to file.

05. apply_handle_stream_commit()

```
+                       if (apply_action == TRANS_LEADER_PARTIAL_SERIALIZE)
+                               stream_cleanup_files(MyLogicalRepWorker->subid, xid);
```

I'm not sure whether the stream files should be removed by LA or PAs. Could you tell me the reason why you choose LA?

===
applyparallelworker.c

05. parallel_apply_can_start()

```
+       if (switching_to_serialize)
+               return false;
```

Could you add a comment like:
Don't start a new parallel apply worker if the leader apply worker has been spilling changes to the disk temporarily.

06. parallel_apply_start_worker()

```
+       /*
+        * Set the xact_state flag in the leader instead of the
+        * parallel apply worker to avoid the race condition where the leader has
+        * already started waiting for the parallel apply worker to finish
+        * processing the transaction while the child process has not yet
+        * processed the first STREAM_START and has not set the
+        * xact_state to true.
+        */
```

I thinkg the word "flag" should be used for boolean, so the comment should be modified.
(There are so many such code-comments, all of them should be modified.)


07. parallel_apply_get_unique_id()

```
+/*
+ * Returns the unique id among all parallel apply workers in the subscriber.
+ */
+static uint16
+parallel_apply_get_unique_id()
```

I think this function is inefficient: the computational complexity will be increased linearly when the number of PAs is
increased.I think the Bitmapset data structure may be used.
 

08. parallel_apply_send_data()

```
#define CHANGES_THRESHOLD    1000
#define SHM_SEND_TIMEOUT_MS    10000
```

I think the timeout may be too long. Could you tell me the background about it?


09. parallel_apply_send_data()

```
            /*
             * Close the stream file if not in a streaming block, the file will
             * be reopened later.
             */
            if (!stream_apply_worker)
                serialize_stream_stop(winfo->shared->xid);
```

a.
IIUC the timings when LA tries to send data but stream_apply_worker is NULL are:
* apply_handle_stream_prepare, 
* apply_handle_stream_start, 
* apply_handle_stream_abort, and
* apply_handle_stream_commit.
And at that time the state of TransApplyAction may be TRANS_LEADER_SEND_TO_PARALLEL. When should be close the file?

b.
Even if this is needed, I think the name of the called function should be modified. Here LA may not handle STREAM_STOP
message.close_stream_file() or something?
 


10. parallel_apply_send_data()

```
            /* Initialize the stream fileset. */
            serialize_stream_start(winfo->shared->xid, true);
```

I think the name of the called function should be modified. Here LA may not handle STREAM_START message.
open_stream_file()or something?
 

11. parallel_apply_send_data()

```
        if (++retry >= CHANGES_THRESHOLD)
        {
            MemoryContext oldcontext;
            StringInfoData msg;
...
            initStringInfo(&msg);
            appendBinaryStringInfo(&msg, data, nbytes);
...
            switching_to_serialize = true;
            apply_dispatch(&msg);
            switching_to_serialize = false;

            break;
        }
```

pfree(msg.data) may be needed.

===
12. worker_internal.h

```
+       pg_atomic_uint32        left_message;
```


ParallelApplyWorkerShared has been already controlled by mutex locks.  Why did you add an atomic variable to the data
structure?

===
13. typedefs.list

ParallelTransState should be added.

===
14. General

I have already said old about it directly, but I point it out to notify other members again.
I have caused a deadlock with two PAs. Indeed it could be solved by the lmgr, but the output seemed not to be kind.
Followingswere copied from the log and we could see that commands executed by apply workers were not output. Can we
extendit, or is it the out of scope?
 


```
2022-11-07 11:11:27.449 UTC [11262] ERROR:  deadlock detected
2022-11-07 11:11:27.449 UTC [11262] DETAIL:  Process 11262 waits for AccessExclusiveLock on object 16393 of class 6100
ofdatabase 0; blocked by process 11320.
 
        Process 11320 waits for ShareLock on transaction 742; blocked by process 11266.
        Process 11266 waits for AccessShareLock on object 16393 of class 6100 of database 0; blocked by process 11262.
        Process 11262: <command string not enabled>
        Process 11320: <command string not enabled>
        Process 11266: <command string not enabled>
```


Best Regards,
Hayato Kuroda
FUJITSU LIMITED

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

07 November 2022, 13:19:25

On Friday, November 4, 2022 7:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Fri, Nov 4, 2022 at 1:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Nov 3, 2022 at 6:36 PM houzj.fnst@fujitsu.com
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > Thanks for the analysis and summary !
> > >
> > > I tried to implement the above idea and here is the patch set.
> > >
> >
> > Few comments on v42-0001
> > ===========================
> >

Thanks for the comments.

> Few more comments on v42-0001
> ===============================
> 1. In parallel_apply_send_data(), it seems winfo->serialize_changes
> and switching_to_serialize are set to indicate that we have changed
> parallel to serialize mode. Isn't using just the
> switching_to_serialize sufficient? Also, it would be better to name
> switching_to_serialize as parallel_to_serialize or something like
> that.

I slightly change the logic to let serialize the message directly when timeout
instead of invoking apply_dispatch again so that we don't need the
switching_to_serialize.

> 
> 2. In parallel_apply_send_data(), the patch has already initialized
> the fileset, and then again in apply_handle_stream_start(), it will do
> the same if we fail while sending stream_start message to the parallel
> worker. It seems we don't need to initialize fileset again for
> TRANS_LEADER_PARTIAL_SERIALIZE state in apply_handle_stream_start()
> unless I am missing something.

Fixed.

> 3.
> apply_handle_stream_start(StringInfo s)
> {
> ...
> + if (!first_segment)
> + {
> + /*
> + * Unlock the shared object lock so that parallel apply worker
> + * can continue to receive and apply changes.
> + */
> + parallel_apply_unlock(winfo->shared->stream_lock_id);
> ...
> }
> 
> Can we have an assert before this unlock call that the lock must be
> held? Similarly, if there are other places then we can have assert
> there as well.

It seems we don't have a standard API can be used without a transaction.
Maybe we can use the list ParallelApplyLockids to check that ?

> 4. It is not very clear to me how maintaining ParallelApplyLockids
> list is helpful.

I will think about this and remove this in next version list if possible.

> 
> 5.
> /*
> + * Handle STREAM START message when the transaction was spilled to disk.
> + *
> + * Inintialize fileset if not yet and open the file.
> + */
> +void
> +serialize_stream_start(TransactionId xid, bool first_segment)
> +{
> + /*
> + * Start a transaction on stream start,
> 
> This function's name and comments seem to indicate that it is to
> handle stream_start message. Is that really the case? It is being
> called from parallel_apply_send_data() which made me think it can be
> used from other places as well.

Adjusted the comment.

Here is the new version patch set which addressed comments as of last Friday.
I also added some comments for the newly introduced codes in this version.

And thanks a lot for the comments that Sawada-san, Peter and Kuroda-san posted today.
I will handle them in next version soon.

Best regards,
Hou zj

Attachment

RE: Perform streaming logical transactions by background workers and parallel apply

From

"Hayato Kuroda (Fujitsu)"

Date:

08 November 2022, 03:51:23

> Fair point. I think if the user wants, she can join with
> pg_stat_subscription based on PID and find the corresponding
> subscription. However, if we want to identify everything via pg_locks
> then I think we should also mention classid or database id as field1.
> So, it would look like: field1: (pg_subscription's oid or current db
> id); field2: OID of subscription in pg_subscription; field3: local or
> remote xid; field4: 0/1 to differentiate between remote and local xid.

Sorry I missed the discussion related with LOCKTAG.
+1 for adding a new tag like LOCKTAG_PARALLEL_APPLY, and
I prefer field1 should be dbid because it is more useful for reporting a lock in DescribeLockTag().

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

08 November 2022, 03:56:43

On Monday, November 7, 2022 9:19 PM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote:
> 
> On Friday, November 4, 2022 7:45 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > On Fri, Nov 4, 2022 at 1:36 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > >
> > > On Thu, Nov 3, 2022 at 6:36 PM houzj.fnst@fujitsu.com
> > > <houzj.fnst@fujitsu.com> wrote:
> > > >
> > > > Thanks for the analysis and summary !
> > > >
> > > > I tried to implement the above idea and here is the patch set.
> > > >
> > >
> > > Few comments on v42-0001
> > > ===========================
> > >
> 
> Thanks for the comments.
> 
> > Few more comments on v42-0001
> > ===============================
> > 1. In parallel_apply_send_data(), it seems winfo->serialize_changes
> > and switching_to_serialize are set to indicate that we have changed
> > parallel to serialize mode. Isn't using just the
> > switching_to_serialize sufficient? Also, it would be better to name
> > switching_to_serialize as parallel_to_serialize or something like
> > that.
> 
> I slightly change the logic to let serialize the message directly when timeout
> instead of invoking apply_dispatch again so that we don't need the
> switching_to_serialize.
> 
> >
> > 2. In parallel_apply_send_data(), the patch has already initialized
> > the fileset, and then again in apply_handle_stream_start(), it will do
> > the same if we fail while sending stream_start message to the parallel
> > worker. It seems we don't need to initialize fileset again for
> > TRANS_LEADER_PARTIAL_SERIALIZE state in apply_handle_stream_start()
> > unless I am missing something.
> 
> Fixed.
> 
> > 3.
> > apply_handle_stream_start(StringInfo s) { ...
> > + if (!first_segment)
> > + {
> > + /*
> > + * Unlock the shared object lock so that parallel apply worker
> > + * can continue to receive and apply changes.
> > + */
> > + parallel_apply_unlock(winfo->shared->stream_lock_id);
> > ...
> > }
> >
> > Can we have an assert before this unlock call that the lock must be
> > held? Similarly, if there are other places then we can have assert
> > there as well.
> 
> It seems we don't have a standard API can be used without a transaction.
> Maybe we can use the list ParallelApplyLockids to check that ?
> 
> > 4. It is not very clear to me how maintaining ParallelApplyLockids
> > list is helpful.
> 
> I will think about this and remove this in next version list if possible.
> 
> >
> > 5.
> > /*
> > + * Handle STREAM START message when the transaction was spilled to disk.
> > + *
> > + * Inintialize fileset if not yet and open the file.
> > + */
> > +void
> > +serialize_stream_start(TransactionId xid, bool first_segment) {
> > + /*
> > + * Start a transaction on stream start,
> >
> > This function's name and comments seem to indicate that it is to
> > handle stream_start message. Is that really the case? It is being
> > called from parallel_apply_send_data() which made me think it can be
> > used from other places as well.
> 
> Adjusted the comment.
> 
> Here is the new version patch set which addressed comments as of last Friday.
> I also added some comments for the newly introduced codes in this version.
>

Sorry, I posted the wrong patch for V43 which lack some changes.
Attach the correct patch set here.

Best regards,
Hou zj

Attachment

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

08 November 2022, 11:49:53

On Mon, Nov 7, 2022 at 6:49 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> On Friday, November 4, 2022 7:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > 3.
> > apply_handle_stream_start(StringInfo s)
> > {
> > ...
> > + if (!first_segment)
> > + {
> > + /*
> > + * Unlock the shared object lock so that parallel apply worker
> > + * can continue to receive and apply changes.
> > + */
> > + parallel_apply_unlock(winfo->shared->stream_lock_id);
> > ...
> > }
> >
> > Can we have an assert before this unlock call that the lock must be
> > held? Similarly, if there are other places then we can have assert
> > there as well.
>
> It seems we don't have a standard API can be used without a transaction.
> Maybe we can use the list ParallelApplyLockids to check that ?
>

Yeah, that occurred to me as well but I am not sure if it is a good
idea to maintain this list just for assertion but if it turns out that
we need to maintain it for a different purpose then we can probably
use it for assert as well.

Few other comments/questions:
=========================
1.
apply_handle_stream_start(StringInfo s)
{
...

+ case TRANS_PARALLEL_APPLY:
...
...
+ /*
+ * Unlock the shared object lock so that the leader apply worker
+ * can continue to send changes.
+ */
+ parallel_apply_unlock(MyParallelShared->stream_lock_id, AccessShareLock);

As per the design in the email [1], this lock needs to be released by
the leader worker during stream start which means it should be
released under the state TRANS_LEADER_SEND_TO_PARALLEL. From the
comments as well, it is not clear to me why at this time leader is
supposed to be blocked. Is there a reason for doing differently than
what is proposed in the original design?

2. Similar to above, it is not clear why the parallel worker needs to
release the stream_lock_id lock at stream_commit and stream_prepare?

3. Am, I understanding correctly that you need to lock/unlock in
apply_handle_stream_abort() for the parallel worker because after
rollback to savepoint, there could be another set of stream or
transaction end commands for which you want to wait? If so, maybe an
additional comment would serve the purpose.

4.
The leader may have sent multiple streaming blocks in the queue
+ * When the child is processing a streaming block. So only try to
+ * lock if there is no message left in the queue.

Let's slightly reword this to: "By the time child is processing the
changes in the current streaming block, the leader may have sent
multiple streaming blocks. So, try to lock only if there is no message
left in the queue."

5.
+parallel_apply_unlock(uint16 lockid, LOCKMODE lockmode)
+{
+ if (!list_member_int(ParallelApplyLockids, lockid))
+ return;
+
+ UnlockSharedObjectForSession(SubscriptionRelationId, MySubscription->oid,
+ lockid, am_leader_apply_worker() ?
+ AccessExclusiveLock:
+ AccessShareLock);

This function should use lockmode argument passed rather than deciding
based on am_leader_apply_worker. I think this is anyway going to
change if we start using a different locktag as discussed in one of
the above emails.

6.
+
 /*
  * Common spoolfile processing.
  */
-static void
-apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
+void
+apply_spooled_messages(FileSet *stream_fileset, TransactionId xid,

Seems like a spurious line addition.

-- 
With Regards,
Amit Kapila.

RE: Perform streaming logical transactions by background workers and parallel apply

From

"Hayato Kuroda (Fujitsu)"

Date:

09 November 2022, 02:54:19

Hi all,

I have tested the patch set in two cases, so I want to share the result. 

====
Case 1. deadlock caused by leader worker, parallel worker, and backend.

Case 2. deadlock caused by non-immutable trigger
===

It has worked well in both cases. PSA reports what I did.
I want to investigate more if anymore wants to check.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachment

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

09 November 2022, 12:24:16

On Mon, Nov 7, 2022 at 1:46 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> Here are my review comments for v42-0001
...
...
>
> 8.
>
> + /*
> + * Resend the pending message to parallel apply worker to cleanup the
> + * queue. Note that parallel apply worker will just ignore this message
> + * as it has already handled this message while applying spooled
> + * messages.
> + */
> + result = shm_mq_send(winfo->mq_handle, strlen(winfo->pending_msg),
> + winfo->pending_msg, false, true);
>
> If I understand this logic it seems a bit hacky. From the comment, it
> seems you are resending a message that you know/expect to be ignored
> simply to make it disappear. (??). Isn't there some other way to clear
> the pending message without requiring a bogus send?
>

IIUC, this handling is required for the case when we are not able to
send a message to parallel apply worker and switch to serialize mode
(write remaining data to file). Basically, it is possible that the
message is only partially sent and there is no way clean the queue. I
feel we can directly free the worker in this case even if there is a
space in the worker pool. The other idea could be that we detach from
shm_mq and then invent a way to re-attach it after we try to reuse the
same worker.

-- 
With Regards,
Amit Kapila.

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

10 November 2022, 15:09:37

On Monday, November 7, 2022 6:18 PM Masahiko Sawada <sawada.mshk@gmail.com>
> 
> On Thu, Nov 3, 2022 at 10:06 PM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Wednesday, November 2, 2022 10:50 AM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> > >
> > > On Mon, Oct 24, 2022 at 8:42 PM Masahiko Sawada
> > > <sawada.mshk@gmail.com> wrote:
> > > >
> > > > On Wed, Oct 12, 2022 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com>
> > > wrote:
> > > > >
> > > > > On Tue, Oct 11, 2022 at 5:52 AM Masahiko Sawada
> > > <sawada.mshk@gmail.com> wrote:
> > > > > >
> > > > > > On Fri, Oct 7, 2022 at 2:00 PM Amit Kapila <amit.kapila16@gmail.com>
> > > wrote:
> > > > > > >
> > > > > > > About your point that having different partition structures for
> > > > > > > publisher and subscriber, I don't know how common it will be once
> we
> > > > > > > have DDL replication. Also, the default value of
> > > > > > > publish_via_partition_root is false which doesn't seem to indicate
> > > > > > > that this is a quite common case.
> > > > > >
> > > > > > So how can we consider these concurrent issues that could happen
> only
> > > > > > when streaming = 'parallel'? Can we restrict some use cases to avoid
> > > > > > the problem or can we have a safeguard against these conflicts?
> > > > > >
> > > > >
> > > > > Yeah, right now the strategy is to disallow parallel apply for such
> > > > > cases as you can see in *0003* patch.
> > > >
> > > > Tightening the restrictions could work in some cases but there might
> > > > still be coner cases and it could reduce the usability. I'm not really
> > > > sure that we can ensure such a deadlock won't happen with the current
> > > > restrictions. I think we need something safeguard just in case. For
> > > > example, if the leader apply worker is waiting for a lock acquired by
> > > > its parallel worker, it cancels the parallel worker's transaction,
> > > > commits its transaction, and restarts logical replication. Or the
> > > > leader can log the deadlock to let the user know.
> > > >
> > >
> > > As another direction, we could make the parallel apply feature robust
> > > if we can detect deadlocks that happen among the leader worker and
> > > parallel workers. I'd like to summarize the idea discussed off-list
> > > (with Amit, Hou-San, and Kuroda-San) for discussion. The basic idea is
> > > that when the leader worker or parallel worker needs to wait for
> > > something (eg. transaction completion, messages) we use lmgr
> > > functionality so that we can create wait-for edges and detect
> > > deadlocks in lmgr.
> > >
> > > For example, a scenario where a deadlock occurs is the following:
> > >
> > > [Publisher]
> > > create table tab1(a int);
> > > create publication pub for table tab1;
> > >
> > > [Subcriber]
> > > creat table tab1(a int primary key);
> > > create subscription sub connection 'port=10000 dbname=postgres'
> > > publication pub with (streaming = parallel);
> > >
> > > TX1:
> > > BEGIN;
> > > INSERT INTO tab1 SELECT i FROM generate_series(1, 5000) s(i); -- streamed
> > >     Tx2:
> > >     BEGIN;
> > >     INSERT INTO tab1 SELECT i FROM generate_series(1, 5000) s(i); --
> streamed
> > >     COMMIT;
> > > COMMIT;
> > >
> > > Suppose a parallel apply worker (PA-1) is executing TX-1 and the
> > > leader apply worker (LA) is executing TX-2 concurrently on the
> > > subscriber. Now, LA is waiting for PA-1 because of the unique key of
> > > tab1 while PA-1 is waiting for LA to send further messages. There is a
> > > deadlock between PA-1 and LA but lmgr cannot detect it.
> > >
> > > One idea to resolve this issue is that we have LA acquire a session
> > > lock on a shared object (by LockSharedObjectForSession()) and have
> > > PA-1 wait on the lock before trying to receive messages. IOW,  LA
> > > acquires the lock before sending STREAM_STOP and releases it if
> > > already acquired before sending STREAM_START, STREAM_PREPARE and
> > > STREAM_COMMIT. For PA-1, it always needs to acquire the lock after
> > > processing STREAM_STOP and then release immediately after acquiring
> > > it. That way, when PA-1 is waiting for LA, we can have a wait-edge
> > > from PA-1 to LA in lmgr, which will make a deadlock in lmgr like:
> > >
> > > LA (waiting to acquire lock) -> PA-1 (waiting to acquire the shared
> > > object) -> LA
> > >
> > > We would need the shared objects per parallel apply worker.
> > >
> > > After detecting a deadlock, we can restart logical replication with
> > > temporarily disabling the parallel apply, which is done by 0005 patch.
> > >
> > > Another scenario is similar to the previous case but TX-1 and TX-2 are
> > > executed by two parallel apply workers (PA-1 and PA-2 respectively).
> > > In this scenario, PA-2 is waiting for PA-1 to complete its transaction
> > > while PA-1 is waiting for subsequent input from LA. Also, LA is
> > > waiting for PA-2 to complete its transaction in order to preserve the
> > > commit order. There is a deadlock among three processes but it cannot
> > > be detected in lmgr because the fact that LA is waiting for PA-2 to
> > > complete its transaction doesn't appear in lmgr (see
> > > parallel_apply_wait_for_xact_finish()). To fix it, we can use
> > > XactLockTableWait() instead.
> > >
> > > However, since XactLockTableWait() considers PREPARED TRANSACTION as
> > > still in progress, probably we need a similar trick as above in case
> > > where a transaction is prepared. For example, suppose that TX-2 was
> > > prepared instead of committed in the above scenario, PA-2 acquires
> > > another shared lock at START_STREAM and releases it at
> > > STREAM_COMMIT/PREPARE. LA can wait on the lock.
> > >
> > > Yet another scenario where LA has to wait is the case where the shm_mq
> > > buffer is full. In the above scenario (ie. PA-1 and PA-2 are executing
> > > transactions concurrently), if  the shm_mq buffer between LA and PA-2
> > > is full, LA has to wait to send messages, and this wait doesn't appear
> > > in lmgr. To fix it, probably we have to use non-blocking write and
> > > wait with a timeout. If timeout is exceeded, the LA will write to file
> > > and indicate PA-2 that it needs to read file for remaining messages.
> > > Then LA will start waiting for commit which will detect deadlock if
> > > any.
> > >
> > > If we can detect deadlocks by having such a functionality or some
> > > other way then we don't need to tighten the restrictions of subscribed
> > > tables' schemas etc.
> >
> > Thanks for the analysis and summary !
> >
> > I tried to implement the above idea and here is the patch set. I have done some
> > basic tests for the new codes and it work fine.
> 
> Thank you for updating the patches!
> 
> Here are comments on v42-0001:

Thanks for the comments.

> We have the following three similar name functions regarding to
> starting a new parallel apply worker:
> 
> parallel_apply_start_worker()
> parallel_apply_setup_worker()
> parallel_apply_setup_dsm()
> 
> It seems to me that we can somewhat merge them since
> parallel_apply_setup_worker() and parallel_apply_setup_dsm() have only
> one caller.

Since these functions are doing different tasks(external function, Launch, DSM), so I 
personally feel it's OK to split them. But if others also feel it's unnecessary I will
merge them.

> ---
> +/*
> + * Extract the streaming mode value from a DefElem.  This is like
> + * defGetBoolean() but also accepts the special value of "parallel".
> + */
> +char
> +defGetStreamingMode(DefElem *def)
> 
> It's a bit unnatural to have this function in define.c since other
> functions in this file for primitive data types. How about having it
> in subscription.c?

Changed.

> ---
>          /*
>           * Exit if any parameter that affects the remote connection
> was changed.
> -         * The launcher will start a new worker.
> +         * The launcher will start a new worker, but note that the
> parallel apply
> +         * worker may or may not restart depending on the value of
> the streaming
> +         * option and whether there will be a streaming transaction.
> 
> In which case does the parallel apply worker don't restart even if the
> streaming option has been changed?
> 
> ---
> I think we should explain somewhere the idea of using locks for
> synchronization between leader and worker. Maybe can we do that with
> sample workload in new README file?

Having a README sounds like a good idea. I think not only the lock design, we might
need to also move some other existing design comments atop worker.c into that. So, maybe
better do that as a separate patch ? For now, I added comments atop applyparallelworker.c.

> ---
> in parallel_apply_send_data():
> 
> +                result = shm_mq_send(winfo->mq_handle, nbytes, data,
> true, true);
> +
> +                if (result == SHM_MQ_SUCCESS)
> +                        break;
> +                else if (result == SHM_MQ_DETACHED)
> +                        ereport(ERROR,
> +
> (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +                                         errmsg("could not send data
> to shared-memory queue")))
> +
> +                Assert(result == SHM_MQ_WOULD_BLOCK);
> +
> +                if (++retry >= CHANGES_THRESHOLD)
> +                {
> +                        MemoryContext oldcontext;
> +                        StringInfoData msg;
> +                        TimestampTz now = GetCurrentTimestamp();
> +
> +                        if (startTime == 0)
> +                                startTime = now;
> +
> +                        if (!TimestampDifferenceExceeds(startTime,
> now, SHM_SEND_TIMEOUT_MS))
> +                                continue;
> 
> IIUC since the parallel worker retries to send data without waits the
> 'retry' will get larger than CHANGES_THRESHOLD in a very short time.
> But the worker waits at least for SHM_SEND_TIMEOUT_MS to spool data
> regardless of 'retry' count. Don't we need to nap somewhat and why do
> we need CHANGES_THRESHOLD?

Oh, I intended to only check for timeout after continuously retrying XX times to
reduce the cost of getting the system time and calculating the time difference.
I added some comments in the code.

> ---
> +/*
> + * Wait until the parallel apply worker's xact_state flag becomes
> + * the same as in_xact.
> + */
> +static void
> +parallel_apply_wait_for_in_xact(ParallelApplyWorkerShared *wshared,
> +
> ParallelTransState xact_state)
> +{
> +        for (;;)
> +        {
> +                /* Stop if the flag becomes the same as in_xact. */
> 
> What do you mean by 'in_xact' here?

Changed.

> ---
> I got the error "ERROR:  invalid logical replication message type ""
> with the following scenario:
> 
> 1. Stop the PA by sending SIGSTOP signal.
> 2. Stream a large transaction so that the LA spools changes to the file for PA.
> 3. Resume the PA by sending SIGCONT signal.
> 4. Stream another large transaction.
> 
> ---
> * On publisher (with logical_decoding_work_mem = 64kB)
> begin;
> insert into t select generate_series(1, 1000);
> rollback;
> begin;
> insert into t select generate_series(1, 1000);
> rollback;
> 
> I got the following error:
> 
> ERROR:  hash table corrupted
> CONTEXT:  processing remote data for replication origin "pg_16393"
> during message type "STREAM START" in transaction 734

Thanks! I think I have fixed them in the new version.

> ---
> IIUC the changes for worker.c in 0001 patch includes both changes:
> 
> 1. apply worker takes action based on the apply_action returned by
> get_transaction_apply_action() per message (or streamed chunk).
> 2. apply worker supports handling parallel apply workers.
> 
> It seems to me that (1) is a rather refactoring patch, so probably we
> can do that in a separate patch so that we can make the patches
> smaller.

I tried it, but it seems the code size of the apply_action is quite small,
Because we only have two action(LEADER_APPLY/LEADER_SERIALIZE) on HEAD branch
and only handle_streamed_transaction use it. I will think if there are other
ways to split the patch.

> ---
> postgres(1:2831190)=# \dRs+ test_sub1
> List of subscriptions
> -[ RECORD 1 ]------+--------------------------
> Name               | test_sub1
> Owner              | masahiko
> Enabled            | t
> Publication        | {test_pub1}
> Binary             | f
> Streaming          | p
> Two-phase commit   | d
> Disable on error   | f
> Origin             | any
> Synchronous commit | off
> Conninfo           | port=5551 dbname=postgres
> Skip LSN           | 0/0
> 
> It's better to show 'on', 'off' or 'streaming' rather than one character.

Changed.

Best regards,
Hou zj

Hi,

I noticed a CFbot failure and here is the new version patch set which should fix that.
I also ran pgindent and made some cosmetic changes in the new version patch.

Best regards,
Hou zj

On Saturday, November 12, 2022 7:06 PM Amit Kapila <amit.kapila16@gmail.com>
> 
> On Fri, Nov 11, 2022 at 2:12 PM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> 
> Few comments on v46-0001:
> ======================
>

Thanks for the comments.

> 1.
> +static void
> +apply_handle_stream_abort(StringInfo s)
> {
> ...
> + /* Send STREAM ABORT message to the parallel apply worker. */
> + parallel_apply_send_data(winfo, s->len, s->data);
> +
> + if (abort_toplevel_transaction)
> + {
> + parallel_apply_unlock_stream(xid, AccessExclusiveLock);
> 
> Shouldn't we need to release this lock before sending the message as
> we are doing for streap_prepare and stream_commit? If there is a
> reason for doing it differently here then let's add some comments for
> the same.

Changed.

> 2. It seems once the patch makes the file state as busy
> (LEADER_FILESET_BUSY), it will only be accessible after the leader
> apply worker receives a transaction end message like stream_commit. Is
> my understanding correct? If yes, then why can't we make it accessible
> after the stream_stop message? Are you worried about the concurrency
> handling for reading and writing the file? If so, we can probably deal
> with it via some lock for reading and writing to file for each change.
> I think after this we may not need additional stream level lock/unlock
> in parallel_apply_spooled_messages. I understand that you probably
> want to keep the code simple so I am not suggesting changing it
> immediately but just wanted to know whether you have considered
> alternatives here.

I thought about this, but it seems the current buffile design doesn't allow two
processes to open the same buffile at the same time(refer to the comment atop
of BufFileOpenFileSet()). This means the LA needs to make sure the PA has
closed the buffile before writing more changes into it. Although we could let
the LA wait for that, but it could cause another kind of deadlock. Suppose the
PA opened the file and is blocked when applying the just read change. And the
LA starts to wait when trying to write the next set of streaming changes into
file because the file is still opened by PA. Then the lock edge is like:

LA (wait for file to be closed) -> PA1 (wait for unique lock in PA2) -> PA2
(wait for stream lock held in LA)

We could introduce another lock for this, but that seems not very great as we
already had two kinds of locks here.

Another solution could be we create different filename for each streaming block
so that the leader don't need to reopen the same file after writing changes
into it, but that seems largely increase the number of temp files and looks a
bit hacky. Or we could let PA open the file, then read and close the file for
each change, but it seems bring some overhead of opening and closing file.

Another solution which doesn't need a new lock could be that we create
different filename for each streaming block so that the leader doesn't need to
reopen the same file after writing changes into it, but that seems largely
increase the number of temp files and looks a bit hacky. Or we could let PA
open the file, then read and close the file for each change, but it seems bring
some overhead of opening and closing file.

Based on above, how about keep the current approach ?(i.e. PA
will open the file only after the leader apply worker receives a transaction
end message like stream_commit). Ideally, it will enter partial serialize mode
only when PA is blocked by a backend or another PA which seems not that common.

> 3. Don't we need to release the transaction lock at stream_abort in
> parallel apply worker? I understand that we are not waiting for it in
> the leader worker but still parallel apply worker should release it if
> acquired at stream_start by it.

I thought that the lock will be automatically released on rollback. But after testing, I find
It’s possible that the lock won't be released if it's a empty streaming transaction. So, I
add the code to release the lock in the new version patch.

> 
> 4. A minor comment change as below:
> diff --git a/src/backend/replication/logical/worker.c
> b/src/backend/replication/logical/worker.c
> index 43f09b7e9a..c771851d1f 100644
> --- a/src/backend/replication/logical/worker.c
> +++ b/src/backend/replication/logical/worker.c
> @@ -1851,6 +1851,9 @@ apply_handle_stream_abort(StringInfo s)
>                         parallel_apply_stream_abort(&abort_data);
> 
>                         /*
> +                        * We need to wait after processing rollback
> to savepoint for the next set
> +                        * of changes.
> +                        *
>                          * By the time parallel apply worker is
> processing the changes in
>                          * the current streaming block, the leader
> apply worker may have
>                          * sent multiple streaming blocks. So, try to
> lock only if there

Merged.

Attach the new version patch set which addressed above comments and comments from [1].

In the new version patch, I renamed parallel_apply_xxx functions to pa_xxx to
make the name shorter according to the suggestion in [1]. Besides, I split the
codes related to partial serialize to 0002 patch to make the patch easier to
review.

[1] https://www.postgresql.org/message-id/CAA4eK1LGyQ%2BS-jCMnYSz_hvoqiNA0Of%3D%2BMksY%3DXTUaRc5XzXJQ%40mail.gmail.com

Best regards,
Hou zj

Attachment

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

16 November 2022, 08:19:48

On Tuesday, November 15, 2022 7:58 PM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote:
> 
> On Saturday, November 12, 2022 7:06 PM Amit Kapila
> <amit.kapila16@gmail.com>
> >
> > On Fri, Nov 11, 2022 at 2:12 PM houzj.fnst@fujitsu.com
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> >
> > Few comments on v46-0001:
> > ======================
> >
> 
> Thanks for the comments.
> 
> > 1.
> > +static void
> > +apply_handle_stream_abort(StringInfo s)
> > {
> > ...
> > + /* Send STREAM ABORT message to the parallel apply worker. */
> > + parallel_apply_send_data(winfo, s->len, s->data);
> > +
> > + if (abort_toplevel_transaction)
> > + {
> > + parallel_apply_unlock_stream(xid, AccessExclusiveLock);
> >
> > Shouldn't we need to release this lock before sending the message as
> > we are doing for streap_prepare and stream_commit? If there is a
> > reason for doing it differently here then let's add some comments for
> > the same.
> 
> Changed.
> 
> > 2. It seems once the patch makes the file state as busy
> > (LEADER_FILESET_BUSY), it will only be accessible after the leader
> > apply worker receives a transaction end message like stream_commit. Is
> > my understanding correct? If yes, then why can't we make it accessible
> > after the stream_stop message? Are you worried about the concurrency
> > handling for reading and writing the file? If so, we can probably deal
> > with it via some lock for reading and writing to file for each change.
> > I think after this we may not need additional stream level lock/unlock
> > in parallel_apply_spooled_messages. I understand that you probably
> > want to keep the code simple so I am not suggesting changing it
> > immediately but just wanted to know whether you have considered
> > alternatives here.
> 
> I thought about this, but it seems the current buffile design doesn't allow two
> processes to open the same buffile at the same time(refer to the comment
> atop of BufFileOpenFileSet()). This means the LA needs to make sure the PA has
> closed the buffile before writing more changes into it. Although we could let
> the LA wait for that, but it could cause another kind of deadlock. Suppose the
> PA opened the file and is blocked when applying the just read change. And the
> LA starts to wait when trying to write the next set of streaming changes into file
> because the file is still opened by PA. Then the lock edge is like:
> 
> LA (wait for file to be closed) -> PA1 (wait for unique lock in PA2) -> PA2 (wait
> for stream lock held in LA)
> 
> We could introduce another lock for this, but that seems not very great as we
> already had two kinds of locks here.
> 
> Another solution could be we create different filename for each streaming
> block so that the leader don't need to reopen the same file after writing
> changes into it, but that seems largely increase the number of temp files and
> looks a bit hacky. Or we could let PA open the file, then read and close the file
> for each change, but it seems bring some overhead of opening and closing file.
> 
> Another solution which doesn't need a new lock could be that we create
> different filename for each streaming block so that the leader doesn't need to
> reopen the same file after writing changes into it, but that seems largely
> increase the number of temp files and looks a bit hacky. Or we could let PA
> open the file, then read and close the file for each change, but it seems bring
> some overhead of opening and closing file.
> 
> Based on above, how about keep the current approach ?(i.e. PA will open the
> file only after the leader apply worker receives a transaction end message like
> stream_commit). Ideally, it will enter partial serialize mode only when PA is
> blocked by a backend or another PA which seems not that common.
> 
> > 3. Don't we need to release the transaction lock at stream_abort in
> > parallel apply worker? I understand that we are not waiting for it in
> > the leader worker but still parallel apply worker should release it if
> > acquired at stream_start by it.
> 
> I thought that the lock will be automatically released on rollback. But after
> testing, I find It’s possible that the lock won't be released if it's a empty
> streaming transaction. So, I add the code to release the lock in the new version
> patch.
> 
> >
> > 4. A minor comment change as below:
> > diff --git a/src/backend/replication/logical/worker.c
> > b/src/backend/replication/logical/worker.c
> > index 43f09b7e9a..c771851d1f 100644
> > --- a/src/backend/replication/logical/worker.c
> > +++ b/src/backend/replication/logical/worker.c
> > @@ -1851,6 +1851,9 @@ apply_handle_stream_abort(StringInfo s)
> >                         parallel_apply_stream_abort(&abort_data);
> >
> >                         /*
> > +                        * We need to wait after processing rollback
> > to savepoint for the next set
> > +                        * of changes.
> > +                        *
> >                          * By the time parallel apply worker is
> > processing the changes in
> >                          * the current streaming block, the leader
> > apply worker may have
> >                          * sent multiple streaming blocks. So, try to
> > lock only if there
> 
> Merged.
> 
> Attach the new version patch set which addressed above comments and
> comments from [1].
> 
> In the new version patch, I renamed parallel_apply_xxx functions to pa_xxx to
> make the name shorter according to the suggestion in [1]. Besides, I split the
> codes related to partial serialize to 0002 patch to make the patch easier to
> review.
> 
> [1]
> https://www.postgresql.org/message-id/CAA4eK1LGyQ%2BS-jCMnYSz_hvoq
> iNA0Of%3D%2BMksY%3DXTUaRc5XzXJQ%40mail.gmail.com

I noticed that I didn't add CHECK_FOR_INTERRUPTS while retrying send message.
So, attach the new version which adds that. Also attach the 0004 patch that
restarts logical replication with temporarily disabling the parallel apply if
failed to apply a transaction in parallel apply worker.

Best regards,
Hou zj

On Fri, Nov 18, 2022 at 7:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Nov 16, 2022 at 1:50 PM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Tuesday, November 15, 2022 7:58 PM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote:
> >
> > I noticed that I didn't add CHECK_FOR_INTERRUPTS while retrying send message.
> > So, attach the new version which adds that. Also attach the 0004 patch that
> > restarts logical replication with temporarily disabling the parallel apply if
> > failed to apply a transaction in parallel apply worker.
> >
>
> Few comments on v48-0001
> ======================
>

I have made quite a few changes in the comments, added some new
comments, and made other cosmetic changes in the attached patch. The
is atop v48-0001*. If these look okay to you, please include them in
the next version. Apart from these, I have a few more comments on
v48-0001*

1.
+static bool
+pa_can_start(TransactionId xid)
+{
+ if (!TransactionIdIsValid(xid))
+ return false;

The caller (see caller of pa_start_worker) already has a check that
xid passed here is valid, so I think this should be an Assert unless I
am missing something in which case it is better to add a comment here.

2. Will it be better to rename pa_start_worker() as
pa_allocate_worker() because it sometimes gets the worker from the
pool and also allocate the hash entry for worker info? That will even
match the corresponding pa_free_worker().

3.
+pa_start_subtrans(TransactionId current_xid, TransactionId top_xid)
{
...
+
+ oldctx = MemoryContextSwitchTo(ApplyContext);
+ subxactlist = lappend_xid(subxactlist, current_xid);
+ MemoryContextSwitchTo(oldctx);
...

Why do we need to allocate this list in a permanent context? IIUC, we
need to use this to maintain subxacts so that it can be later used to
find the given subxact at the time of rollback to savepoint in the
current in-progress transaction, so why do we need it beyond the
transaction being applied? If there is a reason for the same, it would
be better to add some comments for the same.

4.
+pa_stream_abort(LogicalRepStreamAbortData *abort_data)
{
...
+
+ for (i = list_length(subxactlist) - 1; i >= 0; i--)
+ {
+ TransactionId xid_tmp = lfirst_xid(list_nth_cell(subxactlist, i));
+
+ if (xid_tmp == subxid)
+ {
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ {
+ RollbackToSavepoint(spname);
+ CommitTransactionCommand();
+ subxactlist = list_truncate(subxactlist, i + 1);
+ }

I was thinking whether we can have an Assert(false) for the not found
case but it seems if all the changes of a subxact have been skipped
then probably subxid corresponding to "rollback to savepoint" won't be
found in subxactlist and we don't need to do anything for it. If that
is the case, then probably adding a comment for it would be a good
idea, otherwise, we can probably have Assert(false) in the else case.

-- 
With Regards,
Amit Kapila.

Attachment

v48-changes_amit_1.patch

Re: Perform streaming logical transactions by background workers and parallel apply

From

Peter Smith

Date:

21 November 2022, 06:26:03

On Fri, Nov 18, 2022 at 6:03 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> Here are some review comments for v47-0001
>
> (This review is a WIP - I will post more comments for this patch next week)
>

Here are the rest of my comments for v47-0001

======

doc/src/sgml/monitoring.

1.

@@ -1851,6 +1851,11 @@ postgres   27093  0.0  0.0  30096  2752 ?
 Ss   11:34   0:00 postgres: ser
       <entry>Waiting to acquire an advisory user lock.</entry>
      </row>
      <row>
+      <entry><literal>applytransaction</literal></entry>
+      <entry>Waiting to acquire acquire a lock on a remote transaction being
+      applied on the subscriber side.</entry>
+     </row>
+     <row>

1a.
Typo "acquire acquire"

~

1b.
Maybe "on the subscriber side" does not mean much without any context.
Maybe better to word it as below.

SUGGESTION
Waiting to acquire a lock on a remote transaction being applied by a
logical replication subscriber.

======

doc/src/sgml/system-views.sgml

2.

@@ -1361,8 +1361,9 @@
        <literal>virtualxid</literal>,
        <literal>spectoken</literal>,
        <literal>object</literal>,
-       <literal>userlock</literal>, or
-       <literal>advisory</literal>.
+       <literal>userlock</literal>,
+       <literal>advisory</literal> or
+       <literal>applytransaction</literal>.

This change removed the Oxford comma that was there before. I assume
it was unintended.

======

.../replication/logical/applyparallelworker.c

3. globals

The parallel_apply_XXX functions were all shortened to pa_XXX.

I wondered if the same simplification should be done also to the
global statics...

e.g.
ParallelApplyWorkersHash -> PAWorkerHash
ParallelApplyWorkersList -> PAWorkerList
ParallelApplyMessagePending -> PAMessagePending
etc...

~~~

4. pa_get_free_worker

+ foreach(lc, active_workers)
+ {
+ ParallelApplyWorkerInfo *winfo = NULL;
+
+ winfo = (ParallelApplyWorkerInfo *) lfirst(lc);

No need to assign NULL because the next line just overwrites that anyhow.

~

5.

+ /*
+ * Try to free the worker first, because we don't wait for the rollback
+ * command to finish so the worker may not be freed at the end of the
+ * transaction.
+ */
+ if (pa_free_worker(winfo, winfo->shared->xid))
+ continue;
+
+ if (!winfo->in_use)
+ return winfo;

Shouldn't the (!winfo->in_use) check be done first as well -- e.g. why
are we trying to free a worker which is maybe not even in_use?

SUGGESTION (this will need some comment to explain what it is doing)
if (!winfo->in_use || !pa_free_worker(winfo, winfo->shared->xid) &&
!winfo->in_use)
return winfo;

~~~

6. pa_free_worker

+/*
+ * Remove the parallel apply worker entry from the hash table. Stop the work if
+ * there are enough workers in the pool.
+ *

Typo? "work" -> "worker"

~

7.

+ /* Are there enough workers in the pool? */
+ if (napplyworkers > (max_parallel_apply_workers_per_subscription / 2))
+ {

IMO that comment should be something more like "Don't detach/stop the
worker unless..."

~~~

8. pa_send_data

+ /*
+ * Retry after 1s to reduce the cost of getting the system time and
+ * calculating the time difference.
+ */
+ (void) WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ 1000L,
+ WAIT_EVENT_LOGICAL_PARALLEL_APPLY_STATE_CHANGE);

8a.
I am not sure you need to explain the reason in the comment. Just
saying "Wait before retrying." seems sufficient to me.

~

8b.
Instead of the hardwired "1s" in the comment, and 1000L in the code,
maybe better to just have another constant.

SUGGESTION
#define SHM_SEND_RETRY_INTERVAL_MS 1000
#define SHM_SEND_TIMEOUT_MS 10000

~

9.

+ if (startTime == 0)
+ startTime = GetCurrentTimestamp();
+ else if (TimestampDifferenceExceeds(startTime, GetCurrentTimestamp(),

IMO the initial startTime should be at top of the function otherwise
the timeout calculation seems wrong.

======

src/backend/replication/logical/worker.c

10. handle_streamed_transaction

+ * In streaming case (receiving a block of streamed transaction), for
+ * SUBSTREAM_ON mode, simply redirect it to a file for the proper toplevel
+ * transaction, and for SUBSTREAM_PARALLEL mode, send the changes to parallel
+ * apply workers (LOGICAL_REP_MSG_RELATION or LOGICAL_REP_MSG_TYPE changes
+ * will be applied by both leader apply worker and parallel apply workers).

I'm not sure this function comment should be referring to SUBSTREAM_ON
and SUBSTREAM_PARALLEL because the function body does not use those
anywhere in the logic.

~~~

11. apply_handle_stream_start

+ /*
+ * Increment the number of messages waiting to be processed by
+ * parallel apply worker.
+ */
+ pg_atomic_add_fetch_u32(&(winfo->shared->pending_message_count), 1);
+

The &() parens are not needed. Just write &winfo->shared->pending_message_count.

Also, search/replace others like this -- there are a few of them.

~~~

12. apply_handle_stream_stop

+ if (!abort_toplevel_transaction &&
+ pg_atomic_sub_fetch_u32(&(MyParallelShared->pending_message_count), 1) == 0)
+ {
+ pa_lock_stream(MyParallelShared->xid, AccessShareLock);
+ pa_unlock_stream(MyParallelShared->xid, AccessShareLock);
+ }

That lock/unlock seems like it is done just as a way of
testing/waiting for an exclusive lock held on the xid to be released.
But the code is too tricky -- IMO it needs a big comment saying how
this trick works, or maybe better to have a wrapper function for this
for clarity. e.g. pa_wait_nolock_stream(xid); (or some better name)

~~~

13. apply_handle_stream_abort

+ if (abort_toplevel_transaction)
+ {
+ (void) pa_free_worker(winfo, xid);
+ }

Unnecessary { }

~~~

14. maybe_reread_subscription

@@ -3083,8 +3563,9 @@ maybe_reread_subscription(void)
  if (!newsub)
  {
  ereport(LOG,
- (errmsg("logical replication apply worker for subscription \"%s\" will "
- "stop because the subscription was removed",
+ /* translator: first %s is the name of logical replication worker */
+ (errmsg("%s for subscription \"%s\" will stop because the "
+ "subscription was removed", get_worker_name(),
  MySubscription->name)));

  proc_exit(0);
@@ -3094,8 +3575,9 @@ maybe_reread_subscription(void)
  if (!newsub->enabled)
  {
  ereport(LOG,
- (errmsg("logical replication apply worker for subscription \"%s\" will "
- "stop because the subscription was disabled",
+ /* translator: first %s is the name of logical replication worker */
+ (errmsg("%s for subscription \"%s\" will stop because the "
+ "subscription was disabled", get_worker_name(),
  MySubscription->name)));

IMO better to avoid splitting the string literals over multiple line like this.

Please check the rest of the patch too -- there may be many more just like this.

~~~

15. ApplyWorkerMain

@@ -3726,7 +4236,7 @@ ApplyWorkerMain(Datum main_arg)
  }
  else
  {
- /* This is main apply worker */
+ /* This is leader apply worker */
  RepOriginId originid;
"This is leader" -> "This is the leader"

======

src/bin/psql/describe.c

16. describeSubscriptions

+ if (pset.sversion >= 160000)
+ appendPQExpBuffer(&buf,
+   ", (CASE substream\n"
+   "    WHEN 'f' THEN 'off'\n"
+   "    WHEN 't' THEN 'on'\n"
+   "    WHEN 'p' THEN 'parallel'\n"
+   "   END) AS \"%s\"\n",
+   gettext_noop("Streaming"));
+ else
+ appendPQExpBuffer(&buf,
+   ", substream AS \"%s\"\n",
+   gettext_noop("Streaming"));

I'm not sure it is an improvement to change the output "t/f/p" to
"on/off/parallel"

IMO "t/f/parallel" would be better. Then the t/f is consistent with
- how it used to display, and
- all the other boolean fields

======

src/include/replication/worker_internal.h

17. ParallelTransState

+/*
+ * State of the transaction in parallel apply worker.
+ *
+ * These enum values are ordered by the order of transaction state changes in
+ * parallel apply worker.
+ */
+typedef enum ParallelTransState

"ordered by the order" ??

SUGGESTION
The enum values must have the same order as the transaction state transitions.

======

src/include/storage/lock.h

18.

@@ -149,10 +149,12 @@ typedef enum LockTagType
  LOCKTAG_SPECULATIVE_TOKEN, /* speculative insertion Xid and token */
  LOCKTAG_OBJECT, /* non-relation database object */
  LOCKTAG_USERLOCK, /* reserved for old contrib/userlock code */
- LOCKTAG_ADVISORY /* advisory user locks */
+ LOCKTAG_ADVISORY, /* advisory user locks */
+ LOCKTAG_APPLY_TRANSACTION /* transaction being applied on the subscriber
+ * side */
 } LockTagType;

-#define LOCKTAG_LAST_TYPE LOCKTAG_ADVISORY
+#define LOCKTAG_LAST_TYPE LOCKTAG_APPLY_TRANSACTION

 extern PGDLLIMPORT const char *const LockTagTypeNames[];

@@ -278,6 +280,17 @@ typedef struct LOCKTAG
  (locktag).locktag_type = LOCKTAG_ADVISORY, \
  (locktag).locktag_lockmethodid = USER_LOCKMETHOD)

+/*
+ * ID info for a remote transaction on the subscriber side is:
+ * DB OID + SUBSCRIPTION OID + TRANSACTION ID + OBJID
+ */
+#define SET_LOCKTAG_APPLY_TRANSACTION(locktag,dboid,suboid,xid,objid) \
+ ((locktag).locktag_field1 = (dboid), \
+ (locktag).locktag_field2 = (suboid), \
+ (locktag).locktag_field3 = (xid), \
+ (locktag).locktag_field4 = (objid), \
+ (locktag).locktag_type = LOCKTAG_APPLY_TRANSACTION, \
+ (locktag).locktag_lockmethodid = DEFAULT_LOCKMETHOD)

Maybe "on the subscriber side" (2 places above) has no meaning here
because there is no context this is talking about logical replication.
Maybe those comments need to say something more like  "on a logical
replication subscriber"

------
Kind Regards,
Peter Smith.
Fujitsu Australia

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

21 November 2022, 12:34:02

On Saturday, November 19, 2022 6:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Fri, Nov 18, 2022 at 7:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Nov 16, 2022 at 1:50 PM houzj.fnst@fujitsu.com
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > On Tuesday, November 15, 2022 7:58 PM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> > >
> > > I noticed that I didn't add CHECK_FOR_INTERRUPTS while retrying send
> message.
> > > So, attach the new version which adds that. Also attach the 0004
> > > patch that restarts logical replication with temporarily disabling
> > > the parallel apply if failed to apply a transaction in parallel apply worker.
> > >
> >
> > Few comments on v48-0001

Thanks for the comments !

> > ======================
> >
> 
> I have made quite a few changes in the comments, added some new comments,
> and made other cosmetic changes in the attached patch. The is atop v48-0001*.
> If these look okay to you, please include them in the next version. Apart from
> these, I have a few more comments on
> v48-0001*

Thanks, I have checked and merge them.

> 1.
> +static bool
> +pa_can_start(TransactionId xid)
> +{
> + if (!TransactionIdIsValid(xid))
> + return false;
> 
> The caller (see caller of pa_start_worker) already has a check that xid passed
> here is valid, so I think this should be an Assert unless I am missing something in
> which case it is better to add a comment here.

Changed to an Assert().

> 2. Will it be better to rename pa_start_worker() as
> pa_allocate_worker() because it sometimes gets the worker from the pool and
> also allocate the hash entry for worker info? That will even match the
> corresponding pa_free_worker().

Agreed and changed.

> 3.
> +pa_start_subtrans(TransactionId current_xid, TransactionId top_xid)
> {
> ...
> +
> + oldctx = MemoryContextSwitchTo(ApplyContext);
> + subxactlist = lappend_xid(subxactlist, current_xid);
> + MemoryContextSwitchTo(oldctx);
> ...
> 
> Why do we need to allocate this list in a permanent context? IIUC, we need to
> use this to maintain subxacts so that it can be later used to find the given
> subxact at the time of rollback to savepoint in the current in-progress
> transaction, so why do we need it beyond the transaction being applied? If
> there is a reason for the same, it would be better to add some comments for
> the same.

I think you are right, I changed to use TopTransactionContext here.

> 4.
> +pa_stream_abort(LogicalRepStreamAbortData *abort_data)
> {
> ...
> +
> + for (i = list_length(subxactlist) - 1; i >= 0; i--) { TransactionId
> + xid_tmp = lfirst_xid(list_nth_cell(subxactlist, i));
> +
> + if (xid_tmp == subxid)
> + {
> + found = true;
> + break;
> + }
> + }
> +
> + if (found)
> + {
> + RollbackToSavepoint(spname);
> + CommitTransactionCommand();
> + subxactlist = list_truncate(subxactlist, i + 1); }
> 
> I was thinking whether we can have an Assert(false) for the not found case but it
> seems if all the changes of a subxact have been skipped then probably subxid
> corresponding to "rollback to savepoint" won't be found in subxactlist and we
> don't need to do anything for it. If that is the case, then probably adding a
> comment for it would be a good idea, otherwise, we can probably have
> Assert(false) in the else case.

Yes, we might not find the xid for an empty subtransaction. I added some comments
here for the same.

Apart from above, I also addressed the comments in [1] and fixed a bug that
parallel worker exits silently while the leader cannot detect that. In the
latest patch, the parallel apply worker will send a notify('X') message to
leader so that leader can detect the exit.

Here is the new version patch.

[1] https://www.postgresql.org/message-id/CAA4eK1KWgReYbpwEMh1H1ohHoYirv4Aa%3D6v13MutCF9NvHTc5A%40mail.gmail.com

Best regards,
Hou zj

On Monday, November 21, 2022 8:34  PMhouzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote:
> 
> On Saturday, November 19, 2022 6:49 PM Amit Kapila
> <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Nov 18, 2022 at 7:56 AM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > >
> > > On Wed, Nov 16, 2022 at 1:50 PM houzj.fnst@fujitsu.com
> > > <houzj.fnst@fujitsu.com> wrote:
> > > >
> > > > On Tuesday, November 15, 2022 7:58 PM houzj.fnst@fujitsu.com
> > <houzj.fnst@fujitsu.com> wrote:
> > > >
> > > > I noticed that I didn't add CHECK_FOR_INTERRUPTS while retrying
> > > > send
> > message.
> > > > So, attach the new version which adds that. Also attach the 0004
> > > > patch that restarts logical replication with temporarily disabling
> > > > the parallel apply if failed to apply a transaction in parallel apply worker.
> > > >
> > >
> > > Few comments on v48-0001
> 
> Thanks for the comments !
> 
> > > ======================
> > >
> >
> > I have made quite a few changes in the comments, added some new
> > comments, and made other cosmetic changes in the attached patch. The is
> atop v48-0001*.
> > If these look okay to you, please include them in the next version.
> > Apart from these, I have a few more comments on
> > v48-0001*
> 
> Thanks, I have checked and merge them.
> 
> > 1.
> > +static bool
> > +pa_can_start(TransactionId xid)
> > +{
> > + if (!TransactionIdIsValid(xid))
> > + return false;
> >
> > The caller (see caller of pa_start_worker) already has a check that
> > xid passed here is valid, so I think this should be an Assert unless I
> > am missing something in which case it is better to add a comment here.
> 
> Changed to an Assert().
> 
> > 2. Will it be better to rename pa_start_worker() as
> > pa_allocate_worker() because it sometimes gets the worker from the
> > pool and also allocate the hash entry for worker info? That will even
> > match the corresponding pa_free_worker().
> 
> Agreed and changed.
> 
> > 3.
> > +pa_start_subtrans(TransactionId current_xid, TransactionId top_xid)
> > {
> > ...
> > +
> > + oldctx = MemoryContextSwitchTo(ApplyContext);
> > + subxactlist = lappend_xid(subxactlist, current_xid);
> > + MemoryContextSwitchTo(oldctx);
> > ...
> >
> > Why do we need to allocate this list in a permanent context? IIUC, we
> > need to use this to maintain subxacts so that it can be later used to
> > find the given subxact at the time of rollback to savepoint in the
> > current in-progress transaction, so why do we need it beyond the
> > transaction being applied? If there is a reason for the same, it would
> > be better to add some comments for the same.
> 
> I think you are right, I changed to use TopTransactionContext here.
> 
> > 4.
> > +pa_stream_abort(LogicalRepStreamAbortData *abort_data)
> > {
> > ...
> > +
> > + for (i = list_length(subxactlist) - 1; i >= 0; i--) { TransactionId
> > + xid_tmp = lfirst_xid(list_nth_cell(subxactlist, i));
> > +
> > + if (xid_tmp == subxid)
> > + {
> > + found = true;
> > + break;
> > + }
> > + }
> > +
> > + if (found)
> > + {
> > + RollbackToSavepoint(spname);
> > + CommitTransactionCommand();
> > + subxactlist = list_truncate(subxactlist, i + 1); }
> >
> > I was thinking whether we can have an Assert(false) for the not found
> > case but it seems if all the changes of a subxact have been skipped
> > then probably subxid corresponding to "rollback to savepoint" won't be
> > found in subxactlist and we don't need to do anything for it. If that
> > is the case, then probably adding a comment for it would be a good
> > idea, otherwise, we can probably have
> > Assert(false) in the else case.
> 
> Yes, we might not find the xid for an empty subtransaction. I added some
> comments here for the same.
> 
> Apart from above, I also addressed the comments in [1] and fixed a bug that
> parallel worker exits silently while the leader cannot detect that. In the latest
> patch, the parallel apply worker will send a notify('X') message to leader so that
> leader can detect the exit.
> 
> Here is the new version patch.

I noticed that I missed a header file causing CFbot to complain.
Attach a new version patch set which fix that.

Best regards,
Hou zj

Attachment

Re: Perform streaming logical transactions by background workers and parallel apply

From

Peter Smith

Date:

22 November 2022, 05:19:45

Thanks for addressing my review comments on v47-0001.

Here are my review comments for v49-0001.

======

src/backend/replication/logical/applyparallelworker.c

1. GENERAL - NULL checks

There is inconsistent NULL checking in the patch.

Sometimes it is like (!winfo)
Sometimes explicit NULL checks like  (winfo->mq_handle != NULL)

(That is just one example -- there are differences in many places)

It would be better to use a consistent style everywhere.

~

2. GENERAL - Error message worker name

2a.
In worker.c all the messages are now "logical replication apply
worker" or "logical replication parallel apply worker" etc, but in the
applyparallel.c sometimes the "logical replication" part is missing.
IMO all the messages in this patch should be consistently worded.

I've reported some of them in the following comment below, but please
search the whole patch for any I might have missed.

2b.
Consider if maybe all of these ought to be calling get_worker_name()
which is currently static in worker.c. Doing this means any future
changes to get_worker_name won't cause more inconsistencies.

~~~

3. File header comment

+ * IDENTIFICATION src/backend/replication/logical/applyparallelworker.c

The word "IDENTIFICATION" should be on a separate line (for
consistency with every other PG source file)

~

4.

+ * In order for lmgr to detect this, we have LA acquire a session lock on the
+ * remote transaction (by pa_lock_stream()) and have PA wait on the lock before
+ * trying to receive messages. In other words, LA acquires the lock before
+ * sending STREAM_STOP and releases it if already acquired before sending
+ * STREAM_START, STREAM_ABORT(for toplevel transaction), STREAM_PREPARE and
+ * STREAM_COMMIT. For PA, it always needs to acquire the lock after processing
+ * STREAM_STOP and STREAM_ABORT(for subtransaction) and then release
+ * immediately after acquiring it. That way, when PA is waiting for LA, we can
+ * have a wait-edge from PA to LA in lmgr, which will make a deadlock in lmgr
+ * like:

Missing spaces before '(' deliberate?

~~~

5. globals

+/*
+ * Is there a message sent by parallel apply worker which the leader apply
+ * worker need to receive?
+ */
+volatile sig_atomic_t ParallelApplyMessagePending = false;

SUGGESTION
Is there a message sent by a parallel apply worker that the leader
apply worker needs to receive?

~~~

6. pa_get_available_worker

+/*
+ * get an available parallel apply worker from the worker pool.
+ */
+static ParallelApplyWorkerInfo *
+pa_get_available_worker(void)

Uppercase comment

~

7.

+ /*
+ * We first try to free the worker to improve our chances of getting
+ * the worker. Normally, we free the worker after ensuring that the
+ * transaction is committed by parallel worker but for rollbacks, we
+ * don't wait for the transaction to finish so can't free the worker
+ * information immediately.
+ */

7a.
"We first try to free the worker to improve our chances of getting the worker."

SUGGESTION
We first try to free the worker to improve our chances of finding one
that is not in use.

~

7b.
"parallel worker" -> "the parallel worker"

~~~

8. pa_allocate_worker

+ /* Try to get a free parallel apply worker. */
+ winfo = pa_get_available_worker();
+

SUGGESTION
First, try to get a parallel apply worker from the pool.

~~~

9. pa_free_worker

+ * This removes the parallel apply worker entry from the hash table so that it
+ * can't be used. This either stops the worker and free the corresponding info,
+ * if there are enough workers in the pool or just marks it available for
+ * reuse.

BEFORE
This either stops the worker and free the corresponding info, if there
are enough workers in the pool or just marks it available for reuse.

SUGGESTION
If there are enough workers in the pool it stops the worker and frees
the corresponding info, otherwise it just marks the worker as
available for reuse.

~

10.

+ /* Free the corresponding info if the worker exited cleanly. */
+ if (winfo->error_mq_handle == NULL)
+ {
+ pa_free_worker_info(winfo);
+ return true;
+ }

Is it correct that this bypasses the removal from the hash table?

~

11.

+
+ /* Worker is already available for reuse. */
+ if (!winfo->in_use)
+ return false;

Should this quick-exit check for in_use come first?

~~

12. HandleParallelApplyMessage

+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("parallel apply worker exited abnormally"),
+ errcontext("%s", edata.context)));

Maybe "parallel apply worker" -> "logical replication parallel apply
worker" (for consistency with the other error messages)

~

13.


+ default:
+ elog(ERROR, "unrecognized message type received from parallel apply
worker: %c (message length %d bytes)",
+ msgtype, msg->len);
+ }

ditto #12 above.

~

14.

+ case 'X': /* Terminate, indicating clean exit. */
+ {
+ shm_mq_detach(winfo->error_mq_handle);
+ winfo->error_mq_handle = NULL;
+ break;
+ }
+ default:


No need for the { } here.

~~~

15. HandleParallelApplyMessage

+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("lost connection to the parallel apply worker")));
+ }

"parallel apply worker" -> "logical replication parallel apply worker"

~~~

16. pa_init_and_launch_worker

+ /* Setup shared memory. */
+ if (!pa_setup_dsm(winfo))
+ {
+ MemoryContextSwitchTo(oldcontext);
+ pfree(winfo);
+ return NULL;
+ }


Wouldn't it be better to do the pfree before switching back to the oldcontext?

~~~

17. pa_send_data

+ /* Wait before retrying. */
+ rc = WaitLatch(MyLatch,
+    WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+    SHM_SEND_RETRY_INTERVAL_MS,
+    WAIT_EVENT_LOGICAL_PARALLEL_APPLY_STATE_CHANGE);
+
+ if (rc & WL_LATCH_SET)
+ {
+ ResetLatch(MyLatch);
+ CHECK_FOR_INTERRUPTS();
+ }


Instead of CHECK_FOR_INTERRUPTS, should this be calling your new
function ProcessParallelApplyInterrupts?

~

18.

+ if (startTime == 0)
+ startTime = GetCurrentTimestamp();
+ else if (TimestampDifferenceExceeds(startTime, GetCurrentTimestamp(),
+ SHM_SEND_TIMEOUT_MS))
+ ereport(ERROR,
+ (errcode(ERRCODE_CONNECTION_FAILURE),
+ errmsg("terminating logical replication parallel apply worker due to
timeout")));


I'd previously commented that the timeout calculation seemed wrong.
Hou-san replied [1,#9] "start counting from the first failure looks
fine to me." but I am not so sure - e.g. If the timeout is 10s then I
expect it to fail ~10s after the function is called, not 11s after. I
know it's pedantic, but where's the harm in making the calculation
right instead of just nearly right?

IMO probably an easy fix for this is like:

#define SHM_SEND_RETRY_INTERVAL_MS 1000
#define SHM_SEND_TIMEOUT_MS (10000 - SHM_SEND_RETRY_INTERVAL_MS)

~~~

19. pa_wait_for_xact_state

+ /* An interrupt may have occurred while we were waiting. */
+ CHECK_FOR_INTERRUPTS();

Instead of CHECK_FOR_INTERRUPTS, should this be calling your new
function ProcessParallelApplyInterrupts?

~~~

20. pa_savepoint_name

+static void
+pa_savepoint_name(Oid suboid, TransactionId xid, char *spname,
+   Size szsp)

Unnecessary wrapping?

======

src/backend/replication/logical/origin.c

21. replorigin_session_setup

+ * However, we do allow multiple processes to point to the same origin slot
+ * if requested by the caller by passing PID of the process that has already
+ * acquired it. This is to allow using the same origin by multiple parallel
+ * apply processes the provided they maintain commit order, for example, by
+ * allowing only one process to commit at a time.

21a.
I thought the comment should mention this is optional and the special
value acquired_by=0 means don't do this.

~

21b.
"the provided they" ?? typo?

======

src/backend/replication/logical/tablesync.c

22. process_syncing_tables

 process_syncing_tables(XLogRecPtr current_lsn)
 {
+ /*
+ * Skip for parallel apply workers as they don't operate on tables that
+ * are not in ready state. See pa_can_start() and
+ * should_apply_changes_for_rel().
+ */
+ if (am_parallel_apply_worker())
+ return;

SUGGESTION (remove the double negative)
Skip for parallel apply workers because they only operate on tables
that are in a READY state. See pa_can_start() and
should_apply_changes_for_rel().

======

src/backend/replication/logical/worker.c

23. apply_handle_stream_stop


Previously I suggested that this lock/unlock seems too tricky and
needed a comment. The reply [1,#12] was that this is already described
atop parallelapplyworker.c. OK, but in that case maybe here the
comment can just refer to that explanation:

SUGGESTION
Refer to the comments atop applyparallelworker.c for what this lock
and immediate unlock is doing.

~~~

24. apply_handle_stream_abort

+ if (pg_atomic_sub_fetch_u32(&(MyParallelShared->pending_stream_count),
1) == 0)
+ {
+ pa_lock_stream(MyParallelShared->xid, AccessShareLock);
+ pa_unlock_stream(MyParallelShared->xid, AccessShareLock);
+ }

ditto comment #23

~~~

25. apply_worker_clean_exit

+void
+apply_worker_clean_exit(void)
+{
+ /* Notify the leader apply worker that we have exited cleanly. */
+ if (am_parallel_apply_worker())
+ pq_putmessage('X', NULL, 0);
+
+ proc_exit(0);
+}

Somehow it doesn't seem right that the PA worker sending 'X' is here
in worker.c, while the LA worker receipt of this 'X' is in the other
applyparallelworker.c module. Maybe that other function
HandleParallelApplyMessage should also be here in worker.c?

======

src/backend/utils/misc/guc_tables.c

26.

@@ -2957,6 +2957,18 @@ struct config_int ConfigureNamesInt[] =
  NULL,
  },
  &max_sync_workers_per_subscription,
+ 2, 0, MAX_PARALLEL_WORKER_LIMIT,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"max_parallel_apply_workers_per_subscription",
+ PGC_SIGHUP,
+ REPLICATION_SUBSCRIBERS,
+ gettext_noop("Maximum number of parallel apply workers per subscription."),
+ NULL,
+ },
+ &max_parallel_apply_workers_per_subscription,
  2, 0, MAX_BACKENDS,
  NULL, NULL, NULL

Is this correct? Did you mean to change
max_sync_workers_per_subscription, My 1st impression is that there has
been some mixup with the MAX_PARALLEL_WORKER_LIMIT and MAX_BACKENDS or
that this change was accidentally made to the wrong GUC.

======

src/include/replication/worker_internal.h

27. ParallelApplyWorkerShared

+ /*
+ * Indicates whether there are pending streaming blocks in the queue. The
+ * parallel apply worker will check it before starting to wait.
+ */
+ pg_atomic_uint32 pending_stream_count;

A better name might be 'n_pending_stream_blocks'.

~

28. function names

 extern void logicalrep_worker_stop(Oid subid, Oid relid);
+extern void logicalrep_parallel_apply_worker_stop(int slot_no, uint16
generation);
 extern void logicalrep_worker_wakeup(Oid subid, Oid relid);
 extern void logicalrep_worker_wakeup_ptr(LogicalRepWorker *worker);

 extern int logicalrep_sync_worker_count(Oid subid);
+extern int logicalrep_parallel_apply_worker_count(Oid subid);

Would it be better to call those new functions using similar shorter
names as done elsewhere?

logicalrep_parallel_apply_worker_stop -> logicalrep_pa_worker_stop
logicalrep_parallel_apply_worker_count -> logicalrep_pa_worker_count

------
[1] Hou-san's reply to my review v47-0001.
https://www.postgresql.org/message-id/OS0PR01MB571680391393F3CB63469F3E940A9%40OS0PR01MB5716.jpnprd01.prod.outlook.com

Kind Regards,
Peter Smith.
Fujitsu Australia

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

22 November 2022, 12:42:24

On Tues, November 22, 2022 13:20 PM Peter Smith <smithpb2250@gmail.com> wrote:
> Thanks for addressing my review comments on v47-0001.
> 
> Here are my review comments for v49-0001.

Thanks for your comments.

> ======
> 
> src/backend/replication/logical/applyparallelworker.c
> 
> 1. GENERAL - NULL checks
> 
> There is inconsistent NULL checking in the patch.
> 
> Sometimes it is like (!winfo)
> Sometimes explicit NULL checks like  (winfo->mq_handle != NULL)
> 
> (That is just one example -- there are differences in many places)
> 
> It would be better to use a consistent style everywhere.

Changed.

> ~
> 
> 2. GENERAL - Error message worker name
> 
> 2a.
> In worker.c all the messages are now "logical replication apply 
> worker" or "logical replication parallel apply worker" etc, but in the 
> applyparallel.c sometimes the "logical replication" part is missing.
> IMO all the messages in this patch should be consistently worded.
> 
> I've reported some of them in the following comment below, but please 
> search the whole patch for any I might have missed.

Rename LA and PA to the following styles:
```
LA -> logical replication apply worker
PA -> logical replication parallel apply worker ```

> 2b.
> Consider if maybe all of these ought to be calling get_worker_name() 
> which is currently static in worker.c. Doing this means any future 
> changes to get_worker_name won't cause more inconsistencies.

The most error message in applyparallelxx.c can only use "xx parallel worker",
so I think it's fine not to call get_worker_name

> ~~~
> 
> 3. File header comment
> 
> + * IDENTIFICATION 
> + src/backend/replication/logical/applyparallelworker.c
> 
> The word "IDENTIFICATION" should be on a separate line (for 
> consistency with every other PG source file)

Fixed.

> ~
> 
> 4.
> 
> + * In order for lmgr to detect this, we have LA acquire a session 
> + lock on the
> + * remote transaction (by pa_lock_stream()) and have PA wait on the 
> + lock
> before
> + * trying to receive messages. In other words, LA acquires the lock 
> + before
> + * sending STREAM_STOP and releases it if already acquired before 
> + sending
> + * STREAM_START, STREAM_ABORT(for toplevel transaction),
> STREAM_PREPARE and
> + * STREAM_COMMIT. For PA, it always needs to acquire the lock after
> processing
> + * STREAM_STOP and STREAM_ABORT(for subtransaction) and then release
> + * immediately after acquiring it. That way, when PA is waiting for 
> + LA, we can
> + * have a wait-edge from PA to LA in lmgr, which will make a deadlock 
> + in lmgr
> + * like:
> 
> Missing spaces before '(' deliberate?

Added.

> ~~~
> 
> 5. globals
> 
> +/*
> + * Is there a message sent by parallel apply worker which the leader 
> +apply
> + * worker need to receive?
> + */
> +volatile sig_atomic_t ParallelApplyMessagePending = false;
> 
> SUGGESTION
> Is there a message sent by a parallel apply worker that the leader 
> apply worker needs to receive?

Changed.

> ~~~
> 
> 6. pa_get_available_worker
> 
> +/*
> + * get an available parallel apply worker from the worker pool.
> + */
> +static ParallelApplyWorkerInfo *
> +pa_get_available_worker(void)
> 
> Uppercase comment

Changed.

> ~
> 
> 7.
> 
> + /*
> + * We first try to free the worker to improve our chances of getting
> + * the worker. Normally, we free the worker after ensuring that the
> + * transaction is committed by parallel worker but for rollbacks, we
> + * don't wait for the transaction to finish so can't free the worker
> + * information immediately.
> + */
> 
> 7a.
> "We first try to free the worker to improve our chances of getting the worker."
> 
> SUGGESTION
> We first try to free the worker to improve our chances of finding one 
> that is not in use.
> 
> ~
> 
> 7b.
> "parallel worker" -> "the parallel worker"

Changed.

> ~~~
> 
> 8. pa_allocate_worker
> 
> + /* Try to get a free parallel apply worker. */ winfo = 
> + pa_get_available_worker();
> +
> 
> SUGGESTION
> First, try to get a parallel apply worker from the pool.

Changed.

> ~~~
> 
> 9. pa_free_worker
> 
> + * This removes the parallel apply worker entry from the hash table 
> + so that it
> + * can't be used. This either stops the worker and free the 
> + corresponding info,
> + * if there are enough workers in the pool or just marks it available 
> + for
> + * reuse.
> 
> BEFORE
> This either stops the worker and free the corresponding info, if there 
> are enough workers in the pool or just marks it available for reuse.
> 
> SUGGESTION
> If there are enough workers in the pool it stops the worker and frees 
> the corresponding info, otherwise it just marks the worker as 
> available for reuse.

Changed.

> ~
> 
> 10.
> 
> + /* Free the corresponding info if the worker exited cleanly. */ if 
> + (winfo->error_mq_handle == NULL) { pa_free_worker_info(winfo); 
> + return true; }
> 
> Is it correct that this bypasses the removal from the hash table?

I rethink about this, it seems unnecessary to free the information here as
we don't expect the worker to stop unless the leader as them to stop.
So, I temporarily remove this and will think about this in next version.

> ~
> 
> 14.
> 
> + case 'X': /* Terminate, indicating clean exit. */ { 
> + shm_mq_detach(winfo->error_mq_handle);
> + winfo->error_mq_handle = NULL;
> + break;
> + }
> + default:
> 
> 
> No need for the { } here.

Changed.

> ~~~
> 
> 16. pa_init_and_launch_worker
> 
> + /* Setup shared memory. */
> + if (!pa_setup_dsm(winfo))
> + {
> + MemoryContextSwitchTo(oldcontext);
> + pfree(winfo);
> + return NULL;
> + }
> 
> 
> Wouldn't it be better to do the pfree before switching back to the oldcontext?

I think either style seems fine.

> ~~~
> 
> 17. pa_send_data
> 
> + /* Wait before retrying. */
> + rc = WaitLatch(MyLatch,
> +    WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
> +    SHM_SEND_RETRY_INTERVAL_MS,
> +    WAIT_EVENT_LOGICAL_PARALLEL_APPLY_STATE_CHANGE);
> +
> + if (rc & WL_LATCH_SET)
> + {
> + ResetLatch(MyLatch);
> + CHECK_FOR_INTERRUPTS();
> + }
> 
> 
> Instead of CHECK_FOR_INTERRUPTS, should this be calling your new 
> function ProcessParallelApplyInterrupts?

I thought the ProcessParallelApplyInterrupts is intended to be invoked only in main
loop(LogicalParallelApplyLoop) to make the parallel apply worker exit cleanly.

> ~
> 
> 18.
> 
> + if (startTime == 0)
> + startTime = GetCurrentTimestamp();
> + else if (TimestampDifferenceExceeds(startTime, 
> + GetCurrentTimestamp(),
> + SHM_SEND_TIMEOUT_MS))
> + ereport(ERROR,
> + (errcode(ERRCODE_CONNECTION_FAILURE),
> + errmsg("terminating logical replication parallel apply worker due to
> timeout")));
> 
> 
> I'd previously commented that the timeout calculation seemed wrong.
> Hou-san replied [1,#9] "start counting from the first failure looks 
> fine to me." but I am not so sure - e.g. If the timeout is 10s then I 
> expect it to fail ~10s after the function is called, not 11s after. I 
> know it's pedantic, but where's the harm in making the calculation 
> right instead of just nearly right?
> 
> IMO probably an easy fix for this is like:
> 
> #define SHM_SEND_RETRY_INTERVAL_MS 1000 #define SHM_SEND_TIMEOUT_MS 
> (10000 - SHM_SEND_RETRY_INTERVAL_MS)

OK, I moved the place of setting startTime before the WaitLatch.

> ~~~
> 
> 20. pa_savepoint_name
> 
> +static void
> +pa_savepoint_name(Oid suboid, TransactionId xid, char *spname,
> +   Size szsp)
> 
> Unnecessary wrapping?

Changed.

> ======
> 
> src/backend/replication/logical/origin.c
> 
> 21. replorigin_session_setup
> 
> + * However, we do allow multiple processes to point to the same 
> + origin slot
> + * if requested by the caller by passing PID of the process that has 
> + already
> + * acquired it. This is to allow using the same origin by multiple 
> + parallel
> + * apply processes the provided they maintain commit order, for 
> + example, by
> + * allowing only one process to commit at a time.
> 
> 21a.
> I thought the comment should mention this is optional and the special 
> value acquired_by=0 means don't do this.

Added.

> ~
> 
> 21b.
> "the provided they" ?? typo?

Changed.

> ======
> 
> src/backend/replication/logical/tablesync.c
> 
> 22. process_syncing_tables
> 
>  process_syncing_tables(XLogRecPtr current_lsn)  {
> + /*
> + * Skip for parallel apply workers as they don't operate on tables 
> + that
> + * are not in ready state. See pa_can_start() and
> + * should_apply_changes_for_rel().
> + */
> + if (am_parallel_apply_worker())
> + return;
> 
> SUGGESTION (remove the double negative) Skip for parallel apply 
> workers because they only operate on tables that are in a READY state. 
> See pa_can_start() and should_apply_changes_for_rel().

Changed.

> ======
> 
> src/backend/replication/logical/worker.c
> 
> 23. apply_handle_stream_stop
> 
> 
> Previously I suggested that this lock/unlock seems too tricky and 
> needed a comment. The reply [1,#12] was that this is already described 
> atop parallelapplyworker.c. OK, but in that case maybe here the 
> comment can just refer to that explanation:
> 
> SUGGESTION
> Refer to the comments atop applyparallelworker.c for what this lock 
> and immediate unlock is doing.
> 
> ~~~
> 
> 24. apply_handle_stream_abort
> 
> + if 
> + (pg_atomic_sub_fetch_u32(&(MyParallelShared->pending_stream_count),
> 1) == 0)
> + {
> + pa_lock_stream(MyParallelShared->xid, AccessShareLock); 
> + pa_unlock_stream(MyParallelShared->xid, AccessShareLock); }
> 
> ditto comment #23

I feel the place atop the definition of pa_lock_xxx function is a better place to
put the comments, so added there. User can check it when reading the lock
functions.

> ~~~
> 
> 25. apply_worker_clean_exit
> 
> +void
> +apply_worker_clean_exit(void)
> +{
> + /* Notify the leader apply worker that we have exited cleanly. */  
> +if (am_parallel_apply_worker())  pq_putmessage('X', NULL, 0);
> +
> + proc_exit(0);
> +}
> 
> Somehow it doesn't seem right that the PA worker sending 'X' is here 
> in worker.c, while the LA worker receipt of this 'X' is in the other 
> applyparallelworker.c module. Maybe that other function 
> HandleParallelApplyMessage should also be here in worker.c?

I thought the function apply_worker_clean_exit is widely used in worker.c and
is a common function for both leader/parallel apply workers, so I put it in
worker.c. But HandleParallelApplyMessage is a function only for parallel
worker, so it would be better to put it in applyparallelworker.c.

> ======
> 
> src/backend/utils/misc/guc_tables.c
> 
> 26.
> 
> @@ -2957,6 +2957,18 @@ struct config_int ConfigureNamesInt[] =
>   NULL,
>   },
>   &max_sync_workers_per_subscription,
> + 2, 0, MAX_PARALLEL_WORKER_LIMIT,
> + NULL, NULL, NULL
> + },
> +
> + {
> + {"max_parallel_apply_workers_per_subscription",
> + PGC_SIGHUP,
> + REPLICATION_SUBSCRIBERS,
> + gettext_noop("Maximum number of parallel apply workers per 
> + subscription."), NULL, }, 
> + &max_parallel_apply_workers_per_subscription,
>   2, 0, MAX_BACKENDS,
>   NULL, NULL, NULL
> 
> Is this correct? Did you mean to change 
> max_sync_workers_per_subscription, My 1st impression is that there has 
> been some mixup with the MAX_PARALLEL_WORKER_LIMIT and MAX_BACKENDS or 
> that this change was accidentally made to the wrong GUC.

Fixed.

> ======
> 
> src/include/replication/worker_internal.h
> 
> 27. ParallelApplyWorkerShared
> 
> + /*
> + * Indicates whether there are pending streaming blocks in the queue. 
> + The
> + * parallel apply worker will check it before starting to wait.
> + */
> + pg_atomic_uint32 pending_stream_count;
> 
> A better name might be 'n_pending_stream_blocks'.

I am not sure if the name looks better, so didn’t change this.

> ~
> 
> 28. function names
> 
>  extern void logicalrep_worker_stop(Oid subid, Oid relid);
> +extern void logicalrep_parallel_apply_worker_stop(int slot_no, uint16
> generation);
>  extern void logicalrep_worker_wakeup(Oid subid, Oid relid);  extern 
> void logicalrep_worker_wakeup_ptr(LogicalRepWorker *worker);
> 
>  extern int logicalrep_sync_worker_count(Oid subid);
> +extern int logicalrep_parallel_apply_worker_count(Oid subid);
> 
> Would it be better to call those new functions using similar shorter 
> names as done elsewhere?
> 
> logicalrep_parallel_apply_worker_stop -> logicalrep_pa_worker_stop 
> logicalrep_parallel_apply_worker_count -> logicalrep_pa_worker_count

Changed.

Attach new version patch which also fixed an invalid shared memory access bug
in 0002 patch reported by Kuroda-San offlist. 

Best regards,
Hou zj

Attachment

RE: Perform streaming logical transactions by background workers and parallel apply

From

"Hayato Kuroda (Fujitsu)"

Date:

22 November 2022, 13:53:04

Dear Hou,

Thanks for updating the patch!
I tested the case whether the deadlock caused by foreign key constraint could be
detected, and it worked well.

Followings are my review comments. They are basically related with 0001, but
some contents may be not. It takes time to understand 0002 correctly...

01. typedefs.list

LeaderFileSetState should be added to typedefs.list.

02. 032_streaming_parallel_apply.pl

As I said in [1]: the test name may be not matched. Do you have reasons to
revert the change?

03. 032_streaming_parallel_apply.pl

The test does not cover the case that the backend process relates with the
deadlock. IIUC this is another motivation to use a stream/transaction lock.
I think it should be added.

04. log output

While being applied spooled changes by PA, there are so many messages like
"replayed %d changes from file..." and "applied %u changes...". They comes from
apply_handle_stream_stop() and apply_spooled_messages(). They have same meaning,
so I think one of them can be removed.

05. system_views.sql

In the previous version you modified pg_stat_subscription system view. Why do you revert that?

06. interrupt.c - SignalHandlerForShutdownRequest()

In the comment atop SignalHandlerForShutdownRequest(), some processes that assign the function
except SIGTERM are clarified. We may be able to add the parallel apply worker.

07. proto.c - logicalrep_write_stream_abort()

We may able to add assertions for abort_lsn and abort_time, like xid and subxid.

08. guc_tables.c - ConfigureNamesInt

```
&max_sync_workers_per_subscription,
+ 2, 0, MAX_PARALLEL_WORKER_LIMIT,
+ NULL, NULL, NULL
+ },
```

The upper limit for max_sync_workers_per_subscription seems to be wrong, it should
be used for max_parallel_apply_workers_per_subscription.

10. worker.c - maybe_reread_subscription()

```
+ if (am_parallel_apply_worker())
+ ereport(LOG,
+ /* translator: first %s is the name of logical replication worker */
+ (errmsg("%s for subscription \"%s\" will stop because of a parameter change",
+ get_worker_name(), MySubscription->name)));
```

I was not sure get_worker_name() is needed. I think "logical replication apply worker"
should be embedded.

11. worker.c - ApplyWorkerMain()

```
+ (errmsg_internal("%s for subscription \"%s\" two_phase is %s",
+ get_worker_name(),
```

The message for translator is needed.

[1]:
https://www.postgresql.org/message-id/TYAPR01MB58666A97D40AB8919D106AD5F5709%40TYAPR01MB5866.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

23 November 2022, 10:25:42

On Tue, Nov 22, 2022 at 7:23 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
>
> 07. proto.c - logicalrep_write_stream_abort()
>
> We may able to add assertions for abort_lsn and abort_time, like xid and subxid.
>

If you see logicalrep_write_stream_commit(), we have an assertion for
xid but not for LSN and other parameters. I think the current coding
in the patch is consistent with that.

>
> 08. guc_tables.c - ConfigureNamesInt
>
> ```
>                 &max_sync_workers_per_subscription,
> +               2, 0, MAX_PARALLEL_WORKER_LIMIT,
> +               NULL, NULL, NULL
> +       },
> ```
>
> The upper limit for max_sync_workers_per_subscription seems to be wrong, it should
> be used for max_parallel_apply_workers_per_subscription.
>

Right, I don't know why this needs to be changed in the first place.


-- 
With Regards,
Amit Kapila.

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

23 November 2022, 13:40:17

On Tue, Nov 22, 2022 at 7:30 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>

Few minor comments and questions:
============================
1.
+static void
+LogicalParallelApplyLoop(shm_mq_handle *mqh)
{
+ for (;;)
+ {
+ void    *data;
+ Size len;
+
+ ProcessParallelApplyInterrupts();
...
...
+ if (rc & WL_LATCH_SET)
+ {
+ ResetLatch(MyLatch);
+ ProcessParallelApplyInterrupts();
+ }
...
}

Why ProcessParallelApplyInterrupts() is called twice in
LogicalParallelApplyLoop()?

2.
+ * This scenario is similar to the first case but TX-1 and TX-2 are executed by
+ * two parallel apply workers (PA-1 and PA-2 respectively). In this scenario,
+ * PA-2 is waiting for PA-1 to complete its transaction while PA-1 is waiting
+ * for subsequent input from LA. Also, LA is waiting for PA-2 to complete its
+ * transaction in order to preserve the commit order. There is a deadlock among
+ * three processes.
+ *
...
...
+ *
+ * LA (waiting to acquire the local transaction lock) -> PA-1 (waiting to
+ * acquire the lock on the unique index) -> PA-2 (waiting to acquire the lock
+ * on the remote transaction) -> LA
+ *

Isn't the order of PA-1 and PA-2 different in the second paragraph as
compared to the first one.

3.
+ * Deadlock-detection
+ * ------------------

It may be better to keep the title of this section as Locking Considerations.

4. In the section mentioned in Point 3, it would be better to
separately explain why we need session-level locks instead of
transaction level.

5. Add the below comments in the code:
diff --git a/src/backend/replication/logical/applyparallelworker.c
b/src/backend/replication/logical/applyparallelworker.c
index 9385afb6d2..56f00defcf 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -431,6 +431,9 @@ pa_free_worker_info(ParallelApplyWorkerInfo *winfo)
        if (winfo->dsm_seg != NULL)
                dsm_detach(winfo->dsm_seg);

+       /*
+        * Ensure this worker information won't be reused during
worker allocation.
+        */
        ParallelApplyWorkersList = list_delete_ptr(ParallelApplyWorkersList,

                    winfo);

@@ -762,6 +765,10 @@
HandleParallelApplyMessage(ParallelApplyWorkerInfo *winfo, StringInfo
msg)
                                 */
                                error_context_stack = apply_error_context_stack;

+                               /*
+                                * The actual error must be already
reported by parallel apply
+                                * worker.
+                                */
                                ereport(ERROR,

(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
                                                 errmsg("parallel
apply worker exited abnormally"),




-- 
With Regards,
Amit Kapila.

Re: Perform streaming logical transactions by background workers and parallel apply

From

Peter Smith

Date:

25 November 2022, 02:53:37

Here are some review comments for v51-0001.

======

.../replication/logical/applyparallelworker.c

1. General - Error messages, get_worker_name()

I previously wrote a comment to ask if the get_worker_name() should be
used in more places but the reply [1, #2b] was:

> 2b.
> Consider if maybe all of these ought to be calling get_worker_name()
> which is currently static in worker.c. Doing this means any future
> changes to get_worker_name won't cause more inconsistencies.

The most error message in applyparallelxx.c can only use "xx parallel
worker", so I think it's fine not to call get_worker_name

~

I thought the reply missed the point I was trying to make -- I meant
if it was arranged now so *every* message would go via
get_worker_name() then in future somebody wanted to change the names
(e.g. from "logical replication parallel apply worker" to "LR PA
worker") then it would only need to be changed in one central place
instead of hunting down every hardwired error message.

Anyway, you can do it how you want -- I just was not sure you'd got my
original point.

~~~

2. HandleParallelApplyMessage

+ case 'X': /* Terminate, indicating clean exit. */
+ shm_mq_detach(winfo->error_mq_handle);
+ winfo->error_mq_handle = NULL;
+ break;
+ default:
+ elog(ERROR, "unrecognized message type received from logical
replication parallel apply worker: %c (message length %d bytes)",
+ msgtype, msg->len);

The case 'X' code indentation is too much.

======

src/backend/replication/logical/origin.c

3. replorigin_session_setup(RepOriginId node, int acquired_by)

@@ -1075,12 +1075,20 @@ ReplicationOriginExitCleanup(int code, Datum arg)
  * array doesn't have to be searched when calling
  * replorigin_session_advance().
  *
- * Obviously only one such cached origin can exist per process and the current
+ * Normally only one such cached origin can exist per process and the current
  * cached value can only be set again after the previous value is torn down
  * with replorigin_session_reset().
+ *
+ * However, we do allow multiple processes to point to the same origin slot if
+ * requested by the caller by passing PID of the process that has already
+ * acquired it as acquired_by. This is to allow multiple parallel apply
+ * processes to use the same origin, provided they maintain commit order, for
+ * example, by allowing only one process to commit at a time. For the first
+ * process requesting this origin, the acquired_by parameter needs to be set to
+ * 0.
  */
 void
-replorigin_session_setup(RepOriginId node)
+replorigin_session_setup(RepOriginId node, int acquired_by)

I think the meaning of the acquired_by=0 is not fully described here:
"For the first process requesting this origin, the acquired_by
parameter needs to be set to 0."
IMO that seems to be describing it only from POV that you are always
going to want to allow multiple processes. But really this is an
optional feature so you might pass acquired_by=0, not just because
this is the first of multiple, but also because you *never* want to
allow multiple at all. The comment does not convey this meaning.

Maybe something worded like below is better?

SUGGESTION
Normally only one such cached origin can exist per process so the
cached value can only be set again after the previous value is torn
down with replorigin_session_reset(). For this normal case pass
acquired_by=0 (meaning the slot is not allowed to be already acquired
by another process).

However, sometimes multiple processes can safely re-use the same
origin slot (for example, multiple parallel apply processes can safely
use the same origin, provided they maintain commit order by allowing
only one process to commit at a time). For this case the first process
must pass acquired_by=0, and then the other processes sharing that
same origin can pass acquired_by=PID of the first process.

======

src/backend/replication/logical/worker.c

4. GENERAL - get_worker_name()

If you decide it is OK to hardwire some error messages instead of
unconditionally calling the get_worker_name() -- see my #1 review
comment in this post -- then there are some other messages in this
file that also seem like they can be also hardwired because the type
of worker is already known.

Here are some examples:

4a.

+ else if (am_parallel_apply_worker())
+ {
+ if (rel->state != SUBREL_STATE_READY)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ /* translator: first %s is the name of logical replication worker */
+ errmsg("%s for subscription \"%s\" will stop",
+ get_worker_name(), MySubscription->name),
+ errdetail("Cannot handle streamed replication transactions using
parallel apply workers until all tables have been synchronized.")));
+
+ return true;
+ }

In the above code from should_apply_changes_for_rel we already know
this is a parallel apply worker.

~

4b.

+ if (am_parallel_apply_worker())
+ ereport(LOG,
+ /* translator: first %s is the name of logical replication worker */
+ (errmsg("%s for subscription \"%s\" will stop because of a parameter change",
+ get_worker_name(), MySubscription->name)));
+ else

In the above code from maybe_reread_subscription we already know this
is a parallel apply worker.

4c.

  if (am_tablesync_worker())
  ereport(LOG,
- (errmsg("logical replication table synchronization worker for
subscription \"%s\", table \"%s\" has started",
- MySubscription->name, get_rel_name(MyLogicalRepWorker->relid))));
+ /* translator: first %s is the name of logical replication worker */
+ (errmsg("%s for subscription \"%s\", table \"%s\" has started",
+ get_worker_name(), MySubscription->name,
+ get_rel_name(MyLogicalRepWorker->relid))));

In the above code from ApplyWorkerMain we already know this is a
tablesync worker

~~~

5. get_transaction_apply_action

+
+/*
+ * Return the action to take for the given transaction. *winfo is assigned to
+ * the destination parallel worker info (if the action is
+ * TRANS_LEADER_SEND_TO_PARALLEL, otherwise *winfo is assigned NULL.
+ */
+static TransApplyAction
+get_transaction_apply_action(TransactionId xid,
ParallelApplyWorkerInfo **winfo)

There is no closing ')' in the function comment.

~~~

6. apply_worker_clean_exit

+ /* Notify the leader apply worker that we have exited cleanly. */
+ if (am_parallel_apply_worker())
+ pq_putmessage('X', NULL, 0);

IMO the comment would be better inside the if block

SUGGESTION
if (am_parallel_apply_worker())
{
    /* Notify the leader apply worker that we have exited cleanly. */
    pq_putmessage('X', NULL, 0);
}

------

[1] Hou-san's reply to my v49-0001 review.
https://www.postgresql.org/message-id/OS0PR01MB5716339FF7CB759E751492CB940D9%40OS0PR01MB5716.jpnprd01.prod.outlook.com

Kind Regards,
Peter Smith.
Fujitsu Australia

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

27 November 2022, 04:13:34

On Wednesday, November 23, 2022 9:40 PM Amit Kapila <amit.kapila16@gmail.com>
> 
> On Tue, Nov 22, 2022 at 7:30 AM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> 
> Few minor comments and questions:
> ============================
> 1.
> +static void
> +LogicalParallelApplyLoop(shm_mq_handle *mqh)
> {
> + for (;;)
> + {
> + void    *data;
> + Size len;
> +
> + ProcessParallelApplyInterrupts();
> ...
> ...
> + if (rc & WL_LATCH_SET)
> + {
> + ResetLatch(MyLatch);
> + ProcessParallelApplyInterrupts();
> + }
> ...
> }
> 
> Why ProcessParallelApplyInterrupts() is called twice in
> LogicalParallelApplyLoop()?

I think the second call is unnecessary, so removed it.

> 2.
> + * This scenario is similar to the first case but TX-1 and TX-2 are
> + executed by
> + * two parallel apply workers (PA-1 and PA-2 respectively). In this
> + scenario,
> + * PA-2 is waiting for PA-1 to complete its transaction while PA-1 is
> + waiting
> + * for subsequent input from LA. Also, LA is waiting for PA-2 to
> + complete its
> + * transaction in order to preserve the commit order. There is a
> + deadlock among
> + * three processes.
> + *
> ...
> ...
> + *
> + * LA (waiting to acquire the local transaction lock) -> PA-1 (waiting
> + to
> + * acquire the lock on the unique index) -> PA-2 (waiting to acquire
> + the lock
> + * on the remote transaction) -> LA
> + *
> 
> Isn't the order of PA-1 and PA-2 different in the second paragraph as compared
> to the first one.

Fixed.

> 3.
> + * Deadlock-detection
> + * ------------------
> 
> It may be better to keep the title of this section as Locking Considerations.
> 
> 4. In the section mentioned in Point 3, it would be better to separately explain
> why we need session-level locks instead of transaction level.

Added.

> 5. Add the below comments in the code:
> diff --git a/src/backend/replication/logical/applyparallelworker.c
> b/src/backend/replication/logical/applyparallelworker.c
> index 9385afb6d2..56f00defcf 100644
> --- a/src/backend/replication/logical/applyparallelworker.c
> +++ b/src/backend/replication/logical/applyparallelworker.c
> @@ -431,6 +431,9 @@ pa_free_worker_info(ParallelApplyWorkerInfo *winfo)
>         if (winfo->dsm_seg != NULL)
>                 dsm_detach(winfo->dsm_seg);
> 
> +       /*
> +        * Ensure this worker information won't be reused during
> worker allocation.
> +        */
>         ,
> 
>                     winfo);
> 
> @@ -762,6 +765,10 @@
> HandleParallelApplyMessage(ParallelApplyWorkerInfo *winfo, StringInfo
> msg)
>                                  */
>                                 error_context_stack =
> apply_error_context_stack;
> 
> +                               /*
> +                                * The actual error must be already
> reported by parallel apply
> +                                * worker.
> +                                */
>                                 ereport(ERROR,
> 
> (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>                                                  errmsg("parallel apply worker
> exited abnormally"),

Added.

Attach the new version patch which addressed all comments so far.

Besides, I let the PA send a different message to LA when it exits due to
subscription information change. The LA will report a more meaningful message
and restart replication after catching new message to prevent the LA from
sending message to exited PA.

Best regards,
Hou zj

On Sun, Nov 27, 2022 at 9:43 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> Attach the new version patch which addressed all comments so far.
>

Few comments on v52-0001*
========================
1.
pa_free_worker()
{
...
+ /* Free the worker information if the worker exited cleanly. */
+ if (!winfo->error_mq_handle)
+ {
+ pa_free_worker_info(winfo);
+
+ if (winfo->in_use &&
+ !hash_search(ParallelApplyWorkersHash, &xid, HASH_REMOVE, NULL))
+ elog(ERROR, "hash table corrupted");

pa_free_worker_info() pfrees the winfo, so how is it legal to
winfo->in_use in the above check?

Also, why is this check (!winfo->error_mq_handle) required in the
first place in the patch? The worker exits cleanly only when the
leader apply worker sends a SIGINT signal and in that case, we already
detach from the error queue and clean up other worker information.

2.
+HandleParallelApplyMessages(void)
+{
...
...
+ foreach(lc, ParallelApplyWorkersList)
+ {
+ shm_mq_result res;
+ Size nbytes;
+ void    *data;
+ ParallelApplyWorkerInfo *winfo = (ParallelApplyWorkerInfo *) lfirst(lc);
+
+ if (!winfo->error_mq_handle)
+ continue;

Similar to the previous comment, it is not clear whether we need this
check. If required, can we add a comment to indicate the case where it
happens to be true?

Note, there is a similar check for winfo->error_mq_handle in
pa_wait_for_xact_state(). Please add some comments if that is
required.

3. Why is there apply_worker_clean_exit() at the end of
ParallelApplyWorkerMain()? Normally either the leader worker stops
parallel apply, or parallel apply gets stopped because of a parameter
change, or exits because of error, and in none of those cases it can
hit this code path unless I am missing something.

Additionally, I think in LogicalParallelApplyLoop, we will never
receive zero-length messages so that is also wrong and should be
converted to elog(ERROR,..).

4. I think in logicalrep_worker_detach(), we should detach from the
shm error queue so that the parallel apply worker won't try to send a
termination message back to the leader worker.

5.
pa_send_data()
{
...
+ if (startTime == 0)
+ startTime = GetCurrentTimestamp();
...

What is the use of getting the current timestamp before waitlatch
logic, if it is not used before that? It seems that is for the time
logic to look correct. We can probably reduce the 10s interval to 9s
for that.

In this function, we need to add some comments to indicate why the
current logic is used, and also probably we can refer to the comments
atop this file.

6. I think it will be better if we keep stream_apply_worker local to
applyparallelworker.c by exposing functions to cache/resetting the
required info.

7. Apart from the above, I have made a few changes in the comments and
some miscellaneous cosmetic changes in the attached. Kindly include
these in the next version unless you see a problem with any change.

-- 
With Regards,
Amit Kapila.

Attachment

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

28 November 2022, 13:10:42

On Mon, Nov 28, 2022 at 12:49 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
...
>
> 17.
> @@ -388,10 +401,9 @@ static inline void cleanup_subxact_info(void);
>  /*
>   * Serialize and deserialize changes for a toplevel transaction.
>   */
> -static void stream_cleanup_files(Oid subid, TransactionId xid);
>  static void stream_open_file(Oid subid, TransactionId xid,
>   bool first_segment);
> -static void stream_write_change(char action, StringInfo s);
> +static void stream_write_message(TransactionId xid, char action, StringInfo s);
>  static void stream_close_file(void);
>
> 17a.
>
> I felt just saying "file/files" is too vague. All the references to
> the file should be consistent, so IMO everything would be better named
> like:
>
> "stream_cleanup_files" -> "stream_msg_spoolfile_cleanup()"
> "stream_open_file" ->  "stream_msg_spoolfile_open()"
> "stream_close_file" -> "stream_msg_spoolfile_close()"
> "stream_write_message" -> "stream_msg_spoolfile_write_msg()"
>
> ~
>
> 17b.
> IMO there is not enough distinction here between function names
> stream_write_message and stream_write_change. e.g. You cannot really
> tell from their names what might be the difference.
>
> ~~~
>

I think the only new function needed by this patch is
stream_write_message so don't see why to change all others for that. I
see two possibilities to make name better (a) name function as
stream_open_and_write_change, or (b) pass a new argument (boolean
open) to stream_write_change

...
>
> src/include/replication/worker_internal.h
>
> 33. LeaderFileSetState
>
> +/* State of fileset in leader apply worker. */
> +typedef enum LeaderFileSetState
> +{
> + LEADER_FILESET_UNKNOWN,
> + LEADER_FILESET_BUSY,
> + LEADER_FILESET_ACCESSIBLE
> +} LeaderFileSetState;
>
> 33a.
>
> Missing from typedefs.list?
>
> ~
>
> 33b.
>
> I thought some more explanatory comments for the meaning of
> BUSY/ACCESSIBLE should be here.
>
> ~
>
> 33c.
>
> READY might be a better value than ACCESSIBLE
>
> ~
>
> 33d.
> I'm not sure what usefulness does the "LEADER_" and "Leader" prefixes
> give here. Maybe a name like PartialFileSetStat is more meaningful?
>
> e.g. like this?
>
> typedef enum PartialFileSetState
> {
> FS_UNKNOWN,
> FS_BUSY,
> FS_READY
> } PartialFileSetState;
>
> ~
>

All your suggestions in this point look good to me.

>
> ~~~
>
>
> 35. globals
>
>   /*
> + * Indicates whether the leader apply worker needs to serialize the
> + * remaining changes to disk due to timeout when sending data to the
> + * parallel apply worker.
> + */
> + bool serialize_changes;
>
> 35a.
> I wonder if the comment would be better to also mention "via shared memory".
>
> SUGGESTION
>
> Indicates whether the leader apply worker needs to serialize the
> remaining changes to disk due to timeout when attempting to send data
> to the parallel apply worker via shared memory.
>
> ~
>

I think the comment should say " .. the leader apply worker serialized
remaining changes ..."

> 35b.
> I wonder if a more informative variable name might be
> serialize_remaining_changes?
>

I think this needlessly makes the variable name long.

-- 
With Regards,
Amit Kapila.

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

29 November 2022, 04:48:28

On Mon, November 28, 2022 20:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Sun, Nov 27, 2022 at 9:43 AM houzj.fnst@fujitsu.com 
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Attach the new version patch which addressed all comments so far.
> >
> 
> Few comments on v52-0001*
> ========================
> 1.
> pa_free_worker()
> {
> ...
> + /* Free the worker information if the worker exited cleanly. */ if 
> + (!winfo->error_mq_handle) { pa_free_worker_info(winfo);
> +
> + if (winfo->in_use &&
> + !hash_search(ParallelApplyWorkersHash, &xid, HASH_REMOVE, NULL)) 
> + elog(ERROR, "hash table corrupted");
> 
> pa_free_worker_info() pfrees the winfo, so how is it legal to
> winfo->in_use in the above check?
> 
> Also, why is this check (!winfo->error_mq_handle) required in the 
> first place in the patch? The worker exits cleanly only when the 
> leader apply worker sends a SIGINT signal and in that case, we already 
> detach from the error queue and clean up other worker information.

It was intended for the case when a user send a signal, but it seems not standard way to do that.
So, I removed this check (!winfo->error_mq_handle).

> 2.
> +HandleParallelApplyMessages(void)
> +{
> ...
> ...
> + foreach(lc, ParallelApplyWorkersList) { shm_mq_result res; Size 
> + nbytes;
> + void    *data;
> + ParallelApplyWorkerInfo *winfo = (ParallelApplyWorkerInfo *) 
> + lfirst(lc);
> +
> + if (!winfo->error_mq_handle)
> + continue;
> 
> Similar to the previous comment, it is not clear whether we need this 
> check. If required, can we add a comment to indicate the case where it 
> happens to be true?
> Note, there is a similar check for winfo->error_mq_handle in 
> pa_wait_for_xact_state(). Please add some comments if that is 
> required.

Removed this check in these two functions.

> 3. Why is there apply_worker_clean_exit() at the end of 
> ParallelApplyWorkerMain()? Normally either the leader worker stops 
> parallel apply, or parallel apply gets stopped because of a parameter 
> change, or exits because of error, and in none of those cases it can 
> hit this code path unless I am missing something.
> 
> Additionally, I think in LogicalParallelApplyLoop, we will never 
> receive zero-length messages so that is also wrong and should be 
> converted to elog(ERROR,..).

Agreed and changed. 

> 4. I think in logicalrep_worker_detach(), we should detach from the 
> shm error queue so that the parallel apply worker won't try to send a 
> termination message back to the leader worker.

Agreed and changed.

> 5.
> pa_send_data()
> {
> ...
> + if (startTime == 0)
> + startTime = GetCurrentTimestamp();
> ...
> 
> What is the use of getting the current timestamp before waitlatch 
> logic, if it is not used before that? It seems that is for the time 
> logic to look correct. We can probably reduce the 10s interval to 9s 
> for that.

Changed.

> In this function, we need to add some comments to indicate why the 
> current logic is used, and also probably we can refer to the comments 
> atop this file.

Added some comments.

> 6. I think it will be better if we keep stream_apply_worker local to 
> applyparallelworker.c by exposing functions to cache/resetting the 
> required info.

Agree. Added a new function to set the stream_apply_worker.

> 7. Apart from the above, I have made a few changes in the comments and 
> some miscellaneous cosmetic changes in the attached. Kindly include 
> these in the next version unless you see a problem with any change.

Thanks, I have checked and merge them.

Attach the new version patch which addressed all comments.

Best regards,
Hou zj

Attachment

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

29 November 2022, 05:11:34

On Mon, November 28, 2022 15:19 PM Peter Smith <smithpb2250@gmail.com> wrote:
> Here are some review comments for patch v51-0002

Thanks for your comments!

> ======
> 
> 1.
> 
> GENERAL - terminology:  spool/serialize and data/changes/message
> 
> The terminology seems to be used at random. IMO it might be worthwhile 
> rechecking at least that terms are used consistently in all the 
> comments. e.g "serialize message data to disk" ... and later ...
> "apply the spooled messages".
> 
> Also for places where it says "Write the message to file" maybe 
> consider using consistent terminology like "serialize the message to a 
> file".
> 
> Also, try to standardize the way things are described by using 
> consistent (if they really are the same) terminology for "writing 
> data" VS "writing data" VS "writing messages" etc. It is confusing 
> trying to know if the different wording has some intended meaning or 
> is it just random.

I changes some of them, but I think there some things left which I will recheck in next version.
And I think we'd better not change comments that refer to existing comments or functions or variables.
For example, it’s fine for comments that refer to apply_spooled_message to use "spool" "message".


> ======
> 
> Commit message
> 
> 2.
> When the leader apply worker times out while sending a message to the 
> parallel apply worker. Instead of erroring out, switch to partial 
> serialize mode and let the leader serialize all remaining changes to 
> the file and notify the parallel apply workers to read and apply them at the end of the transaction.
> 
> ~
> 
> The first sentence seems incomplete
> 
> SUGGESTION.
> In patch 0001 if the leader apply worker times out while attempting to 
> send a message to the parallel apply worker it results in an ERROR.
> 
> This patch (0002) modifies that behaviour, so instead of erroring it 
> will switch to "partial serialize" mode -  in this mode the leader 
> serializes all remaining changes to a file and notifies the parallel 
> apply workers to read and apply them at the end of the transaction.
> 
> ~~~
> 
> 3.
> 
> This patch 0002 is called “Serialize partial changes to disk if the 
> shm_mq buffer is full”, but the commit message is saying nothing about 
> the buffer filling up. I think the Commit message should be mentioning 
> something that makes the commit patch name more relevant. Otherwise 
> change the patch name.

Changed.

> ======
> 
> .../replication/logical/applyparallelworker.c
> 
> 4. File header comment
> 
> + * timeout is exceeded, the LA will write to file and indicate PA-2 
> + that it
> + * needs to read file for remaining messages. Then LA will start 
> + waiting for
> + * commit which will detect deadlock if any. (See pa_send_data() and 
> + typedef
> + * enum TransApplyAction)
> 
> "needs to read file for remaining messages" -> "needs to read that 
> file for the remaining messages"

Changed.

> ~~~
> 
> 5. pa_free_worker
> 
> + /*
> + * Stop the worker if there are enough workers in the pool.
> + *
> + * XXX we also need to stop the worker if the leader apply worker
> + * serialized part of the transaction data to a file due to send timeout.
> + * This is because the message could be partially written to the 
> + queue due
> + * to send timeout and there is no way to clean the queue other than
> + * resending the message until it succeeds. To avoid complexity, we
> + * directly stop the worker in this case.
> + */
> + if (winfo->serialize_changes ||
> + napplyworkers > (max_parallel_apply_workers_per_subscription / 2))
> 
> 5a.
> 
> + * XXX we also need to stop the worker if the leader apply worker
> + * serialized part of the transaction data to a file due to send timeout.
> 
> SUGGESTION
> XXX The worker is also stopped if the leader apply worker needed to 
> serialize part of the transaction data due to a send timeout.
> 
> ~
> 
> 5b.
> 
> + /* Unlink the files with serialized changes. */ if
> + (winfo->serialize_changes)
> + stream_cleanup_files(MyLogicalRepWorker->subid, winfo->shared->xid);
> 
> A better comment might be
> 
> SUGGESTION
> Unlink any files that were needed to serialize partial changes.

Changed.

> ~~~
> 
> 6. pa_spooled_messages
> 
> /*
>  * Replay the spooled messages in the parallel apply worker if leader 
> apply
>  * worker has finished serializing changes to the file.
>  */
> static void
> pa_spooled_messages(void)
> 
> 6a.
> IMO a better name for this function would be 
> pa_apply_spooled_messages();

Not sure about this.

> ~
> 
> 6b.
> "if leader apply" -> "if the leader apply"

Changed.

> ~
> 
> 7.
> 
> + /*
> + * Acquire the stream lock if the leader apply worker is serializing
> + * changes to the file, because the parallel apply worker will no 
> + longer
> + * have a chance to receive a STREAM_STOP and acquire the lock until 
> + the
> + * leader serialize all changes to the file.
> + */
> + if (fileset_state == LEADER_FILESET_BUSY) { 
> + pa_lock_stream(MyParallelShared->xid, AccessShareLock); 
> + pa_unlock_stream(MyParallelShared->xid, AccessShareLock); }
> 
> SUGGESTION (rearranged comment - please check, I am not sure if I got 
> this right)
> 
> If the leader apply worker is still (busy) serializing partial changes 
> then the parallel apply worker acquires the stream lock now.
> Otherwise, it would not have a chance to receive a STREAM_STOP (and 
> acquire the stream lock) until the leader had serialized all changes.

Changed.

> ~~~
> 
> 8. pa_send_data
> 
> + *
> + * When sending data times out, data will be serialized to disk. And 
> + the
> + * current streaming transaction will enter PARTIAL_SERIALIZE mode, 
> + which
> means
> + * that subsequent data will also be serialized to disk.
>   */
>  void
>  pa_send_data(ParallelApplyWorkerInfo *winfo, Size nbytes, const void
> *data)
> 
> SUGGESTION (minor comment change)
> 
> If the attempt to send data via shared memory times out, then we will 
> switch to "PARTIAL_SERIALIZE mode" for the current transaction. This 
> means that the current data and any subsequent data for this 
> transaction will be serialized to disk.

Changed.

> ~
> 
> 9.
> 
>   Assert(!IsTransactionState());
> + Assert(!winfo->serialize_changes);
> 
> How about also asserting that this must be the LA worker?

Not sure about this as I think the parallel apply worker won't have a winfo.

> ~
> 
> 10.
> 
> + /*
> + * The parallel apply worker might be stuck for some reason, so
> + * stop sending data to parallel worker and start to serialize
> + * data to files.
> + */
> + winfo->serialize_changes = true;
> 
> SUGGESTION (minor reword)
> The parallel apply worker might be stuck for some reason, so stop 
> sending data directly to it and start to serialize data to files 
> instead.

Changed.

> ~
> 
> 11.
> + /* Skip first byte and statistics fields. */ msg.cursor += 
> + SIZE_STATS_MESSAGE + 1;
> 
> IMO it would be better for the comment order and the code calculation 
> order to be the same.
> 
> SUGGESTION
> /* Skip first byte and statistics fields. */ msg.cursor += 1 + 
> SIZE_STATS_MESSAGE;

Changed.

> ~
> 
> 12. pa_stream_abort
> 
> + /*
> + * If the parallel apply worker is applying the spooled
> + * messages, we save the current file position and close the
> + * file to prevent the file from being accidentally closed on
> + * rollback.
> + */
> + if (stream_fd)
> + {
> + BufFileTell(stream_fd, &fileno, &offset); BufFileClose(stream_fd); 
> + reopen_stream_fd = true; }
> +
>   RollbackToSavepoint(spname);
>   CommitTransactionCommand();
>   subxactlist = list_truncate(subxactlist, i + 1);
> +
> + /*
> + * Reopen the file and set the file position to the saved
> + * position.
> + */
> + if (reopen_stream_fd)
> 
> It seems a bit vague to just refer to "close the file" and "reopen the 
> file" in these comments. IMO it would be better to call this file by a 
> name like "the message spool file" or similar. Please check all other 
> similar comments.

Changed.

> ~~~
> 
> 13. pa_set_fileset_state
> 
>  /*
> + * Set the fileset_state flag for the given parallel apply worker. 
> +The
> + * stream_fileset of the leader apply worker will be written into the 
> +shared
> + * memory if the fileset_state is LEADER_FILESET_ACCESSIBLE.
> + */
> +void
> +pa_set_fileset_state(ParallelApplyWorkerShared *wshared, 
> +LeaderFileSetState fileset_state) {
> 
> 13a.
> 
> It is an enum -- not a "flag", so:
> 
> "fileset_state flag" -> "fileste state"

Changed.

> ~~
> 
> 13b.
> 
> It seemed strange to me that the comment/code says this state is only 
> written to shm when it is "ACCESSIBLE".... IIUC this same filestate 
> lingers around to be reused for other workers so I expected the state 
> should *always* be written whenever the LA changes it. (I mean even if 
> the PA is not needing to look at this member, I still think it should 
> have the current/correct value in it).

I think we will always change the state.
Or do you mean the fileset is only written(not the state) when it is ACCESSIBLE?
The fileset cannot be used before it's READY, so I didn't write that fileset into
shared memory before that.

> ======
> 
> src/backend/replication/logical/worker.c
> 
> 14. TRANS_LEADER_SEND_TO_PARALLEL
> 
> + * TRANS_LEADER_PARTIAL_SERIALIZE:
> + * The action means that we are in the leader apply worker and have 
> + sent
> some
> + * changes to the parallel apply worker, but the remaining changes 
> + need to be
> + * serialized to disk due to timeout while sending data, and the 
> + parallel apply
> + * worker will apply these changes when the final commit arrives.
> + *
> + * One might think we can use LEADER_SERIALIZE directly. But in 
> + partial
> + * serialize mode, in addition to serializing changes to file, the 
> + leader
> + * worker needs to write the STREAM_XXX message to disk, and needs to 
> + wait
> for
> + * parallel apply worker to finish the transaction when processing 
> + the
> + * transaction finish command. So a new action was introduced to make 
> + the
> logic
> + * clearer.
> + *
>   * TRANS_LEADER_SEND_TO_PARALLEL:
> 
> 
> SUGGESTION (Minor wording changes)
> The action means that we are in the leader apply worker and have sent 
> some changes directly to the parallel apply worker, due to timeout 
> while sending data the remaining changes need to be serialized to 
> disk. The parallel apply worker will apply these serialized changes 
> when the final commit arrives.
> 
> LEADER_SERIALIZE could not be used for this case because, in addition 
> to serializing changes, the leader worker also needs to write the 
> STREAM_XXX message to disk, and wait for the parallel apply worker to 
> finish the transaction when processing the transaction finish command.
> So this new action was introduced to make the logic clearer.

Changed.

> ~
> 
> 15.
>   /* Actions for streaming transactions. */
>   TRANS_LEADER_SERIALIZE,
> + TRANS_LEADER_PARTIAL_SERIALIZE,
>   TRANS_LEADER_SEND_TO_PARALLEL,
>   TRANS_PARALLEL_APPLY
> 
> Although it makes no difference I felt it would be better to put 
> TRANS_LEADER_PARTIAL_SERIALIZE *after* TRANS_LEADER_SEND_TO_PARALLEL 
> because that would be the order that these mode changes occur in the 
> logic...

I thought that it is fine as it follows LEADER_SERIALIZE which is similar to
LEADER_PARTIAL_SERIALIZE.

> ~~~
> 
> 16.
> 
> @@ -375,7 +388,7 @@ typedef struct ApplySubXactData  static 
> ApplySubXactData subxact_data = {0, 0, InvalidTransactionId, NULL};
> 
>  static inline void subxact_filename(char *path, Oid subid, 
> TransactionId xid); -static inline void changes_filename(char *path, 
> Oid subid, TransactionId xid);
> +inline void changes_filename(char *path, Oid subid, TransactionId 
> +xid);
> 
> IIUC (see [1]) when this function was made non-static the "inline"
> should have been put into the header file.

Changed this function from "inline void" to "void" as I am not sure is it better to put
this function's definition on header file.

> ~
> 
> 17.
> @@ -388,10 +401,9 @@ static inline void cleanup_subxact_info(void);
>  /*
>   * Serialize and deserialize changes for a toplevel transaction.
>   */
> -static void stream_cleanup_files(Oid subid, TransactionId xid); 
> static void stream_open_file(Oid subid, TransactionId xid,
>   bool first_segment);
> -static void stream_write_change(char action, StringInfo s);
> +static void stream_write_message(TransactionId xid, char action, 
> +StringInfo s);
>  static void stream_close_file(void);
> 
> 17a.
> 
> I felt just saying "file/files" is too vague. All the references to 
> the file should be consistent, so IMO everything would be better named
> like:
> 
> "stream_cleanup_files" -> "stream_msg_spoolfile_cleanup()"
> "stream_open_file" ->  "stream_msg_spoolfile_open()"
> "stream_close_file" -> "stream_msg_spoolfile_close()"
> "stream_write_message" -> "stream_msg_spoolfile_write_msg()"

Renamed the function stream_write_message to stream_open_and_write_change.

> ~
> 
> 17b.
> IMO there is not enough distinction here between function names 
> stream_write_message and stream_write_change. e.g. You cannot really 
> tell from their names what might be the difference.

Changed the name.

> ~~~
> 
> 18.
> 
> @@ -586,6 +595,7 @@ handle_streamed_transaction(LogicalRepMsgType
> action, StringInfo s)
>   TransactionId current_xid;
>   ParallelApplyWorkerInfo *winfo;
>   TransApplyAction apply_action;
> + StringInfoData original_msg;
> 
>   apply_action = get_transaction_apply_action(stream_xid, &winfo);
> 
> @@ -595,6 +605,8 @@ handle_streamed_transaction(LogicalRepMsgType
> action, StringInfo s)
> 
>   Assert(TransactionIdIsValid(stream_xid));
> 
> + original_msg = *s;
> +
>   /*
>   * We should have received XID of the subxact as the first part of the
>   * message, so extract it.
> @@ -618,10 +630,14 @@ handle_streamed_transaction(LogicalRepMsgType
> action, StringInfo s)
>   stream_write_change(action, s);
>   return true;
> 
> + case TRANS_LEADER_PARTIAL_SERIALIZE:
>   case TRANS_LEADER_SEND_TO_PARALLEL:
>   Assert(winfo);
> 
> - pa_send_data(winfo, s->len, s->data);
> + if (apply_action == TRANS_LEADER_SEND_TO_PARALLEL) 
> + pa_send_data(winfo, s->len, s->data); else 
> + stream_write_change(action, &original_msg);
> 
> The original_msg is not used except for TRANS_LEADER_PARTIAL_SERIALIZE 
> case so I think it should only be declared/assigned in the scope of 
> that 'else'

The member 'cursor' of 's' is changed after invoking the function pq_getmsgint.
So 'original_msg' is assigned before invoking the function pq_getmsgint.

> ~
> 
> 20.
> 
> + /*
> + * Close the file before committing if the parallel apply is
> + * applying spooled changes.
> + */
> + if (stream_fd)
> + BufFileClose(stream_fd);
> 
> I found this a bit confusing because there is already a
> stream_close_file() wrapper function which does almost the same as 
> this. So either this code should be calling that function, or the 
> comment here should be explaining why this code is NOT calling that 
> function.

Changed.

> ~~~
> 
> 21. serialize_stream_start
> 
> +/*
> + * Initialize fileset (if not already done).
> + *
> + * Create a new file when first_segment is true, otherwise open the 
> +existing
> + * file.
> + */
> +void
> +serialize_stream_start(TransactionId xid, bool first_segment)
> 
> IMO this function should be called stream_msg_spoolfile_init() or
> stream_msg_spoolfile_begin() to match the pattern for function names 
> of the message spool file that I previously suggested. (see review 
> comment #17a)

I am not sure about the name is better. I will think over this and adjust in next version.

> ~
> 
> 22.
> 
> + /*
> + * Initialize the worker's stream_fileset if we haven't yet. This 
> + will be
> + * used for the entire duration of the worker so create it in a 
> + permanent
> + * context. We create this on the very first streaming message from 
> + any
> + * transaction and then use it for this and other streaming transactions.
> + * Now, we could create a fileset at the start of the worker as well 
> + but
> + * then we won't be sure that it will ever be used.
> + */
> + if (!MyLogicalRepWorker->stream_fileset)
> 
> I assumed this is a typo "Now," --> "Note," ?

That seems the existing comments, I am not sure it's a typo or not.

> ~
> 
> 24.
> 
>   /*
> - * Start a transaction on stream start, this transaction will be
> - * committed on the stream stop unless it is a tablesync worker in
> - * which case it will be committed after processing all the
> - * messages. We need the transaction for handling the buffile,
> - * used for serializing the streaming data and subxact info.
> + * serialize_stream_start will start a transaction, this
> + * transaction will be committed on the stream stop unless it is a
> + * tablesync worker in which case it will be committed after
> + * processing all the messages. We need the transaction for
> + * handling the buffile, used for serializing the streaming data
> + * and subxact info.
>   */
> - begin_replication_step();
> + serialize_stream_start(stream_xid, first_segment); break;
> 
> Make the comment a bit more natural.
> 
> SUGGESTION
> 
> Function serialize_stream_start starts a transaction. This transaction 
> will be committed on the stream stop unless it is a tablesync worker 
> in which case it will be committed after processing all the messages.
> We need this transaction for handling the BufFile, used for 
> serializing the streaming data and subxact info.

Changed.

> ~
> 
> 25.
> 
> + case TRANS_LEADER_PARTIAL_SERIALIZE:
>   /*
> - * Initialize the worker's stream_fileset if we haven't yet. This
> - * will be used for the entire duration of the worker so create it
> - * in a permanent context. We create this on the very first
> - * streaming message from any transaction and then use it for this
> - * and other streaming transactions. Now, we could create a
> - * fileset at the start of the worker as well but then we won't be
> - * sure that it will ever be used.
> + * The file should have been created when entering
> + * PARTIAL_SERIALIZE mode so no need to create it again. The
> + * transaction started in serialize_stream_start will be committed
> + * on the stream stop.
>   */
> - if (!MyLogicalRepWorker->stream_fileset)
> 
> BEFORE
> The file should have been created when entering PARTIAL_SERIALIZE mode 
> so no need to create it again.
> 
> SUGGESTION
> The message spool file was already created when entering 
> PARTIAL_SERIALIZE mode.

Changed.

> ~~~
> 
> 26. serialize_stream_stop
> 
>  /*
> + * Update the information about subxacts and close the file.
> + *
> + * This function should be called when the serialize_stream_start 
> +function has
> + * been called.
> + */
> +void
> +serialize_stream_stop(TransactionId xid)
> 
> Maybe 2nd part of that comment should be something more like
> 
> SUGGESTION
> This function ends what was started by the function serialize_stream_start().

I am thinking about a new function name and will adjust this in next version.

> ~
> 
> 27.
> 
> + /*
> + * Close the file with serialized changes, and serialize information 
> + about
> + * subxacts for the toplevel transaction.
> + */
> + subxact_info_write(MyLogicalRepWorker->subid, xid); 
> + stream_close_file();
> 
> Should the comment and the code be in the same order?
> 
> SUGGESTION
> Serialize information about subxacts for the toplevel transaction, 
> then close the stream messages spool file.

Changed.

> ~~~
> 
> 28. handle_stream_abort
> 
> + case TRANS_LEADER_PARTIAL_SERIALIZE:
> + Assert(winfo);
> +
> + /*
> + * Parallel apply worker might have applied some changes, so write
> + * the STREAM_ABORT message so that the parallel apply worker can
> + * rollback the subtransaction if needed.
> + */
> + stream_write_message(xid, LOGICAL_REP_MSG_STREAM_ABORT, 
> + &original_msg);
> +
> 
> 28a.
> The original_msg is not used except for TRANS_LEADER_PARTIAL_SERIALIZE 
> case so I think it should only be declared/assigned in the scope of 
> that case.
> 
> ~
> 
> 28b.
> "so that the parallel apply worker can" -> "so that it can"

Changed.

> ~~~
> 
> 29. apply_spooled_messages
> 
> +void
> +apply_spooled_messages(FileSet *stream_fileset, TransactionId xid,
> +    XLogRecPtr lsn)
>  {
>   StringInfoData s2;
>   int nchanges;
>   char path[MAXPGPATH];
>   char    *buffer = NULL;
>   MemoryContext oldcxt;
> - BufFile    *fd;
> 
> - maybe_start_skipping_changes(lsn);
> + if (!am_parallel_apply_worker())
> + maybe_start_skipping_changes(lsn);
> 
>   /* Make sure we have an open transaction */
>   begin_replication_step();
> @@ -1810,8 +1913,8 @@ apply_spooled_messages(TransactionId xid, 
> XLogRecPtr lsn)
>   changes_filename(path, MyLogicalRepWorker->subid, xid);
>   elog(DEBUG1, "replaying changes from file \"%s\"", path);
> 
> - fd = BufFileOpenFileSet(MyLogicalRepWorker->stream_fileset, path, 
> O_RDONLY,
> - false);
> + stream_fd = BufFileOpenFileSet(stream_fileset, path, O_RDONLY, 
> + false); stream_xid = xid;
> 
> IMO it seems strange to me that the fileset is passed as a parameter 
> but then the resulting fd is always assigned to a single global 
> variable (regardless of what the fileset was passed).

I am not sure about this as we already have similar code in stream_open_file().

> ~
> 
> 30.
> 
> - BufFileClose(fd);
> -
> + BufFileClose(stream_fd);
>   pfree(buffer);
>   pfree(s2.data);
> 
> +done:
> + stream_fd = NULL;
> + stream_xid = InvalidTransactionId;
> +
> 
> This code fragment seems to be doing almost the same as what function
> stream_close_file() is doing. Should you just call that instead?

Changed.

> ======
> 
> src/include/replication/worker_internal.h
> 
> 33. LeaderFileSetState
> 
> +/* State of fileset in leader apply worker. */ typedef enum 
> +LeaderFileSetState {  LEADER_FILESET_UNKNOWN,  LEADER_FILESET_BUSY, 
> +LEADER_FILESET_ACCESSIBLE } LeaderFileSetState;
> 
> 33a.
> 
> Missing from typedefs.list?
> 
> ~
> 
> 33b.
> 
> I thought some more explanatory comments for the meaning of 
> BUSY/ACCESSIBLE should be here.
>
> ~
> 
> 33c.
> 
> READY might be a better value than ACCESSIBLE
> 
> ~
> 
> 33d.
> I'm not sure what usefulness does the "LEADER_" and "Leader" prefixes 
> give here. Maybe a name like PartialFileSetStat is more meaningful?
> 
> e.g. like this?
> 
> typedef enum PartialFileSetState
> {
> FS_UNKNOWN,
> FS_BUSY,
> FS_READY
> } PartialFileSetState;

Changed.

> ~~~
> 
> 34. ParallelApplyWorkerShared
> 
> + /*
> + * The leader apply worker will serialize changes to the file after
> + * entering PARTIAL_SERIALIZE mode and share the fileset with the 
> + parallel
> + * apply worker when processing the transaction finish command. And 
> + then
> + * the parallel apply worker will apply all the spooled messages.
> + *
> + * Don't use SharedFileSet here as we need the fileset to survive 
> + after
> + * releasing the shared memory so that the leader apply worker can 
> + re-use
> + * the fileset for next streaming transaction.
> + */
> + LeaderFileSetState fileset_state;
> + FileSet fileset;
> 
> Minor rewording of that comment
> 
> SUGGESTION
> After entering PARTIAL_SERIALIZE mode, the leader apply worker will 
> serialize changes to the file, and share the fileset with the parallel 
> apply worker when processing the transaction finish command. Then the 
> parallel apply worker will apply all the spooled messages.
> 
> FileSet is used here instead of SharedFileSet because we need it to 
> survive after releasing the shared memory so that the leader apply 
> worker can re-use the same fileset for the next streaming transaction.

Changed.

> ~~~
> 
> 35. globals
> 
>   /*
> + * Indicates whether the leader apply worker needs to serialize the
> + * remaining changes to disk due to timeout when sending data to the
> + * parallel apply worker.
> + */
> + bool serialize_changes;
> 
> 35a.
> I wonder if the comment would be better to also mention "via shared memory".
> 
> SUGGESTION
> 
> Indicates whether the leader apply worker needs to serialize the 
> remaining changes to disk due to timeout when attempting to send data 
> to the parallel apply worker via shared memory.

Changed.

Best regards,
Hou zj

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

29 November 2022, 12:33:53

On Tue, Nov 29, 2022 at 10:18 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> Attach the new version patch which addressed all comments.
>

Review comments on v53-0001*
==========================
1.
 Subscription *MySubscription = NULL;
-static bool MySubscriptionValid = false;
+bool MySubscriptionValid = false;

It seems still this variable is used in worker.c, so why it's scope changed?

2.
/* fields valid only when processing streamed transaction */
-static bool in_streamed_transaction = false;
+bool in_streamed_transaction = false;

Is it really required to change the scope of this variable? Can we
think of exposing a macro or inline function to check it in
applyparallelworker.c?

3.
should_apply_changes_for_rel(LogicalRepRelMapEntry *rel)
 {
  if (am_tablesync_worker())
  return MyLogicalRepWorker->relid == rel->localreloid;
+ else if (am_parallel_apply_worker())
+ {
+ if (rel->state != SUBREL_STATE_READY)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("logical replication parallel apply worker for subscription
\"%s\" will stop",

Is this check sufficient? What if the rel->state is
SUBREL_STATE_UNKNOWN? I think that will be possible when the refresh
publication has not been yet performed after adding a new relation to
the publication. If that is true then won't we need to simply ignore
that change and continue instead of erroring out? Can you please once
test and check this case?

4.
+
+ case TRANS_PARALLEL_APPLY:
+ list_free(subxactlist);
+ subxactlist = NIL;
+
+ apply_handle_commit_internal(&commit_data);

I don't think we need to retail pfree subxactlist as this is allocated
in TopTransactionContext and will be freed at commit/prepare. This way
freeing looks a bit adhoc to me and you need to expose this list
outside applyparallelworker.c which doesn't seem like a good idea to
me either.

5.
+ apply_handle_commit_internal(&commit_data);
+
+ pa_set_xact_state(MyParallelShared, PARALLEL_TRANS_FINISHED);
+ pa_unlock_transaction(xid, AccessShareLock);
+
+ elog(DEBUG1, "finished processing the transaction finish command");

I think in this and similar DEBUG logs, we can tell the exact command
instead of writing 'finish'.

6.
apply_handle_stream_commit()
{
...
+ /*
+ * After sending the data to the parallel apply worker, wait for
+ * that worker to finish. This is necessary to maintain commit
+ * order which avoids failures due to transaction dependencies and
+ * deadlocks.
+ */
+ pa_wait_for_xact_finish(winfo);
+
+ pgstat_report_stat(false);
+ store_flush_position(commit_data.end_lsn);
+ stop_skipping_changes();
+
+ (void) pa_free_worker(winfo, xid);
...
}

apply_handle_stream_prepare(StringInfo s)
{
+
+ /*
+ * After sending the data to the parallel apply worker, wait for
+ * that worker to finish. This is necessary to maintain commit
+ * order which avoids failures due to transaction dependencies and
+ * deadlocks.
+ */
+ pa_wait_for_xact_finish(winfo);
+ (void) pa_free_worker(winfo, prepare_data.xid);

- /* unlink the files with serialized changes and subxact info. */
- stream_cleanup_files(MyLogicalRepWorker->subid, prepare_data.xid);
+ in_remote_transaction = false;
+
+ store_flush_position(prepare_data.end_lsn);

In both of the above functions, we should be consistent in calling
pa_free_worker() function which I think should be immediately after
pa_wait_for_xact_finish(). Is there a reason for not being consistent
here?

7.
+ res = shm_mq_receive(winfo->error_mq_handle, &nbytes, &data, true);
+
+ /*
+ * The leader will detach from the error queue and set it to NULL
+ * before preparing to stop all parallel apply workers, so we don't
+ * need to handle error messages anymore.
+ */
+ if (!winfo->error_mq_handle)
+ continue;

This check must be done before calling shm_mq_receive. So, changed it
in the attached patch.

8.
@@ -2675,6 +3156,10 @@ store_flush_position(XLogRecPtr remote_lsn)
 {
  FlushPosition *flushpos;

+ /* Skip for parallel apply workers. */
+ if (am_parallel_apply_worker())
+ return;

It is okay to always update the flush position by leader apply worker
but I think the leader won't have updated value for XactLastCommitEnd
as the local transaction is committed by parallel apply worker.

9.
@@ -3831,11 +4366,11 @@ ApplyWorkerMain(Datum main_arg)

  ereport(DEBUG1,
  (errmsg_internal("logical replication apply worker for subscription
\"%s\" two_phase is %s",
- MySubscription->name,
- MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED
? "DISABLED" :
- MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ?
"PENDING" :
- MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ?
"ENABLED" :
- "?")));
+ MySubscription->name,
+ MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED
? "DISABLED" :
+ MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ?
"PENDING" :
+ MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ?
"ENABLED" :
+ "?")));

Is this change related to this patch?

10. What is the reason to expose ApplyErrorCallbackArg via worker_internal.h?

11. The order to declare pa_set_stream_apply_worker() in
worker_internal.h and define in applyparallelworker.c is not the same.
Similarly, please check all other functions.

12. Apart from the above, I have made a few changes in the comments
and some other cosmetic changes in the attached patch.

-- 
With Regards,
Amit Kapila.

Attachment

changes_amit_v53_1.patch

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

30 November 2022, 06:50:25

On Tue, Nov 29, 2022 at 6:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> 12. Apart from the above, I have made a few changes in the comments
> and some other cosmetic changes in the attached patch.
>

I have made some additional changes in the comments at various places.
Kindly check the attached and let me know your thoughts.

-- 
With Regards,
Amit Kapila.

Attachment

changes_amit_2.patch

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

30 November 2022, 10:53:50

On Tue, Nov 29, 2022 at 10:18 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> Attach the new version patch which addressed all comments.
>

Some comments on v53-0002*
========================
1. I think testing the scenario where the shm_mq buffer is full
between the leader and parallel apply worker would require a large
amount of data and then also there is no guarantee. How about having a
developer GUC [1] force_apply_serialize which allows us to serialize
the changes and only after commit the parallel apply worker would be
allowed to apply it?

I am not sure if we can reliably test the serialization of partial
changes (like some changes have been already sent to parallel apply
worker and then serialization happens) but at least we can test the
serialization of complete xacts and their execution via parallel apply
worker.

2.
+ /*
+ * The stream lock is released when processing changes in a
+ * streaming block, so the leader needs to acquire the lock here
+ * before entering PARTIAL_SERIALIZE mode to ensure that the
+ * parallel apply worker will wait for the leader to release the
+ * stream lock.
+ */
+ if (in_streamed_transaction &&
+ action != LOGICAL_REP_MSG_STREAM_STOP)
+ {
+ pa_lock_stream(winfo->shared->xid, AccessExclusiveLock);

This comment is not completely correct because we can even acquire the
lock for the very streaming chunk. This check will work but doesn't
appear future-proof or at least not very easy to understand though I
don't have a better suggestion at this stage. Can we think of a better
check here?

3. I have modified a few comments in v53-0002* patch and the
incremental patch for the same is attached.

[1] - https://www.postgresql.org/docs/devel/runtime-config-developer.html

-- 
With Regards,
Amit Kapila.

Attachment

changes_amit_v53_0002.patch

RE: Perform streaming logical transactions by background workers and parallel apply

From

"Hayato Kuroda (Fujitsu)"

Date:

30 November 2022, 11:35:58

Dear hackers,

> 1. I think testing the scenario where the shm_mq buffer is full
> between the leader and parallel apply worker would require a large
> amount of data and then also there is no guarantee. How about having a
> developer GUC [1] force_apply_serialize which allows us to serialize
> the changes and only after commit the parallel apply worker would be
> allowed to apply it?
> 
> I am not sure if we can reliably test the serialization of partial
> changes (like some changes have been already sent to parallel apply
> worker and then serialization happens) but at least we can test the
> serialization of complete xacts and their execution via parallel apply
> worker.

I agreed for adding the developer options, because the part that LA serialize
changes and PAs read and apply them might be complex. I have reported some
bugs around here.

One idea: A threshold(integer) can be introduced as the developer GUC.
LA skips to send data or jumps to serialization part to PA via shm_mq_send() when
it has sent more than (threshold) times. This may be able to test the partial-serialization case.
Default(-1) means no-op, and 0 means all changes must be serialized.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

30 November 2022, 13:40:39

On Tuesday, November 29, 2022 8:34 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Tue, Nov 29, 2022 at 10:18 AM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Attach the new version patch which addressed all comments.
> >
> 
> Review comments on v53-0001*

Thanks for the comments!
> ==========================
> 1.
>  Subscription *MySubscription = NULL;
> -static bool MySubscriptionValid = false;
> +bool MySubscriptionValid = false;
> 
> It seems still this variable is used in worker.c, so why it's scope changed?

I think it's not needed. Removed.

> 2.
> /* fields valid only when processing streamed transaction */ -static bool
> in_streamed_transaction = false;
> +bool in_streamed_transaction = false;
> 
> Is it really required to change the scope of this variable? Can we think of
> exposing a macro or inline function to check it in applyparallelworker.c?

Introduced a new function.

> 3.
> should_apply_changes_for_rel(LogicalRepRelMapEntry *rel)  {
>   if (am_tablesync_worker())
>   return MyLogicalRepWorker->relid == rel->localreloid;
> + else if (am_parallel_apply_worker())
> + {
> + if (rel->state != SUBREL_STATE_READY)
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("logical replication parallel apply worker for subscription
> \"%s\" will stop",
> 
> Is this check sufficient? What if the rel->state is SUBREL_STATE_UNKNOWN? I
> think that will be possible when the refresh publication has not been yet
> performed after adding a new relation to the publication. If that is true then
> won't we need to simply ignore that change and continue instead of erroring
> out? Can you please once test and check this case?

You are right. Changed to not report an ERROR for SUBREL_STATE_UNKNOWN.

> 4.
> +
> + case TRANS_PARALLEL_APPLY:
> + list_free(subxactlist);
> + subxactlist = NIL;
> +
> + apply_handle_commit_internal(&commit_data);
> 
> I don't think we need to retail pfree subxactlist as this is allocated in
> TopTransactionContext and will be freed at commit/prepare. This way freeing
> looks a bit adhoc to me and you need to expose this list outside
> applyparallelworker.c which doesn't seem like a good idea to me either.

Removed the list_free.

> 5.
> + apply_handle_commit_internal(&commit_data);
> +
> + pa_set_xact_state(MyParallelShared, PARALLEL_TRANS_FINISHED);
> + pa_unlock_transaction(xid, AccessShareLock);
> +
> + elog(DEBUG1, "finished processing the transaction finish command");
> 
> I think in this and similar DEBUG logs, we can tell the exact command instead of
> writing 'finish'.

Changed.

> 6.
> apply_handle_stream_commit()
> {
> ...
> + /*
> + * After sending the data to the parallel apply worker, wait for
> + * that worker to finish. This is necessary to maintain commit
> + * order which avoids failures due to transaction dependencies and
> + * deadlocks.
> + */
> + pa_wait_for_xact_finish(winfo);
> +
> + pgstat_report_stat(false);
> + store_flush_position(commit_data.end_lsn);
> + stop_skipping_changes();
> +
> + (void) pa_free_worker(winfo, xid);
> ...
> }

> apply_handle_stream_prepare(StringInfo s) {
> +
> + /*
> + * After sending the data to the parallel apply worker, wait for
> + * that worker to finish. This is necessary to maintain commit
> + * order which avoids failures due to transaction dependencies and
> + * deadlocks.
> + */
> + pa_wait_for_xact_finish(winfo);
> + (void) pa_free_worker(winfo, prepare_data.xid);
> 
> - /* unlink the files with serialized changes and subxact info. */
> - stream_cleanup_files(MyLogicalRepWorker->subid, prepare_data.xid);
> + in_remote_transaction = false;
> +
> + store_flush_position(prepare_data.end_lsn);
> 
> 
> In both of the above functions, we should be consistent in calling
> pa_free_worker() function which I think should be immediately after
> pa_wait_for_xact_finish(). Is there a reason for not being consistent here?

Changed the order to make them consistent.

> 7.
> + res = shm_mq_receive(winfo->error_mq_handle, &nbytes, &data, true);
> +
> + /*
> + * The leader will detach from the error queue and set it to NULL
> + * before preparing to stop all parallel apply workers, so we don't
> + * need to handle error messages anymore.
> + */
> + if (!winfo->error_mq_handle)
> + continue;
> 
> This check must be done before calling shm_mq_receive. So, changed it in the
> attached patch.

Thanks, merged.

> 8.
> @@ -2675,6 +3156,10 @@ store_flush_position(XLogRecPtr remote_lsn)  {
>   FlushPosition *flushpos;
> 
> + /* Skip for parallel apply workers. */ if (am_parallel_apply_worker())
> + return;
> 
> It is okay to always update the flush position by leader apply worker but I think
> the leader won't have updated value for XactLastCommitEnd as the local
> transaction is committed by parallel apply worker.

I added a field in shared memory so that the parallel apply worker can pass
the XactLastCommitEnd to leader and then the leader will store that.

> 9.
> @@ -3831,11 +4366,11 @@ ApplyWorkerMain(Datum main_arg)
> 
>   ereport(DEBUG1,
>   (errmsg_internal("logical replication apply worker for subscription \"%s\"
> two_phase is %s",
> - MySubscription->name,
> - MySubscription->twophasestate ==
> LOGICALREP_TWOPHASE_STATE_DISABLED
> ? "DISABLED" :
> - MySubscription->twophasestate ==
> LOGICALREP_TWOPHASE_STATE_PENDING ?
> "PENDING" :
> - MySubscription->twophasestate ==
> LOGICALREP_TWOPHASE_STATE_ENABLED ?
> "ENABLED" :
> - "?")));
> + MySubscription->name,
> + MySubscription->twophasestate ==
> LOGICALREP_TWOPHASE_STATE_DISABLED
> ? "DISABLED" :
> + MySubscription->twophasestate ==
> LOGICALREP_TWOPHASE_STATE_PENDING ?
> "PENDING" :
> + MySubscription->twophasestate ==
> LOGICALREP_TWOPHASE_STATE_ENABLED ?
> "ENABLED" :
> + "?")));
> 
> Is this change related to this patch?

I think accidentally changed due to pgident. Reverted.

> 10. What is the reason to expose ApplyErrorCallbackArg via worker_internal.h?

The parallel apply worker need to set the origin name into this. I introduced another function
to set this.

> 11. The order to declare pa_set_stream_apply_worker() in worker_internal.h and
> define in applyparallelworker.c is not the same.
> Similarly, please check all other functions.

Changed.

> 12. Apart from the above, I have made a few changes in the comments and
> some other cosmetic changes in the attached patch.

Thanks, I have checked and merged them.

Attach the new version patch set.

I haven't addressed comment #1 and #2 from [1], I need to think about it and
will handle it soon. Besides, I haven't renamed serialize_stream_start/stop and
haven't finished the word consistency check for comments, I think I will handle
them soon.

[1] https://www.postgresql.org/message-id/CAA4eK1LGKYUDFZ_jFPrU497wQf2HNvt5a%2BtCTpqSeWSG6kfpSA%40mail.gmail.com

Best regards,
Hou zj

Attachment

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

30 November 2022, 13:51:48

On Wednesday, November 30, 2022 9:41 PM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote:
> 
> On Tuesday, November 29, 2022 8:34 PM Amit Kapila
> > Review comments on v53-0001*
> 
> Attach the new version patch set.

Sorry, there were some mistakes in the previous patch set.
Here is the correct V54 patch set. I also ran pgindent for the patch set.

Best regards,
Hou zj

On Wed, Nov 30, 2022 at 4:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> 2.
> + /*
> + * The stream lock is released when processing changes in a
> + * streaming block, so the leader needs to acquire the lock here
> + * before entering PARTIAL_SERIALIZE mode to ensure that the
> + * parallel apply worker will wait for the leader to release the
> + * stream lock.
> + */
> + if (in_streamed_transaction &&
> + action != LOGICAL_REP_MSG_STREAM_STOP)
> + {
> + pa_lock_stream(winfo->shared->xid, AccessExclusiveLock);
>
> This comment is not completely correct because we can even acquire the
> lock for the very streaming chunk. This check will work but doesn't
> appear future-proof or at least not very easy to understand though I
> don't have a better suggestion at this stage. Can we think of a better
> check here?
>

One idea is that we acquire this lock every time and callers like
stream_commit are responsible to release it. Also, we can handle the
close of stream file in the respective callers. I think that will make
this part of the patch easier to follow.

Some other comments:
=====================
1. The handling of buffile inside pa_stream_abort() looks bit ugly to
me. I think you primarily required it because the buffile opened by
parallel apply worker is in CurrentResourceOwner. Can we think of
having a new resource owner to apply spooled messages? I think that
will avoid the need to have a special purpose code to handle buffiles
in parallel apply worker.

2.
@@ -564,6 +571,7 @@ handle_streamed_transaction(LogicalRepMsgType
action, StringInfo s)
  TransactionId current_xid;
  ParallelApplyWorkerInfo *winfo;
  TransApplyAction apply_action;
+ StringInfoData original_msg;

  apply_action = get_transaction_apply_action(stream_xid, &winfo);

@@ -573,6 +581,8 @@ handle_streamed_transaction(LogicalRepMsgType
action, StringInfo s)

  Assert(TransactionIdIsValid(stream_xid));

+ original_msg = *s;
+
  /*
  * We should have received XID of the subxact as the first part of the
  * message, so extract it.
@@ -596,10 +606,14 @@ handle_streamed_transaction(LogicalRepMsgType
action, StringInfo s)
  stream_write_change(action, s);
  return true;

+ case TRANS_LEADER_PARTIAL_SERIALIZE:
  case TRANS_LEADER_SEND_TO_PARALLEL:
  Assert(winfo);

- pa_send_data(winfo, s->len, s->data);
+ if (apply_action == TRANS_LEADER_SEND_TO_PARALLEL)
+ pa_send_data(winfo, s->len, s->data);
+ else
+ stream_write_change(action, &original_msg);

Please add the comment to specify the reason to remember the original string.

3.
@@ -1797,8 +1907,8 @@ apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
  changes_filename(path, MyLogicalRepWorker->subid, xid);
  elog(DEBUG1, "replaying changes from file \"%s\"", path);

- fd = BufFileOpenFileSet(MyLogicalRepWorker->stream_fileset, path, O_RDONLY,
- false);
+ stream_fd = BufFileOpenFileSet(stream_fileset, path, O_RDONLY, false);
+ stream_xid = xid;

Why do we need stream_xid here? I think we can avoid having global
stream_fd if the comment #1 is feasible.

4.
+ * TRANS_LEADER_APPLY:
+ * The action means that we

/The/This. Please make a similar change for other actions.

5. Apart from the above, please find a few changes to the comments for
0001 and 0002 patches in the attached patches.

-- 
With Regards,
Amit Kapila.

Thursday, December 1, 2022 8:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Wed, Nov 30, 2022 at 4:23 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > 2.
> > + /*
> > + * The stream lock is released when processing changes in a
> > + * streaming block, so the leader needs to acquire the lock here
> > + * before entering PARTIAL_SERIALIZE mode to ensure that the
> > + * parallel apply worker will wait for the leader to release the
> > + * stream lock.
> > + */
> > + if (in_streamed_transaction &&
> > + action != LOGICAL_REP_MSG_STREAM_STOP) {
> > + pa_lock_stream(winfo->shared->xid, AccessExclusiveLock);
> >
> > This comment is not completely correct because we can even acquire the
> > lock for the very streaming chunk. This check will work but doesn't
> > appear future-proof or at least not very easy to understand though I
> > don't have a better suggestion at this stage. Can we think of a better
> > check here?
> >
> 
> One idea is that we acquire this lock every time and callers like stream_commit
> are responsible to release it. Also, we can handle the close of stream file in the
> respective callers. I think that will make this part of the patch easier to follow.

Changed.

> Some other comments:
> =====================
> 1. The handling of buffile inside pa_stream_abort() looks bit ugly to me. I think
> you primarily required it because the buffile opened by parallel apply worker is
> in CurrentResourceOwner. 

Changed to use toplevel transaction's resource.

> Can we think of having a new resource owner to
> apply spooled messages? I think that will avoid the need to have a special
> purpose code to handle buffiles in parallel apply worker.

I am thinking about this and will address this in next version.

> 2.
> @@ -564,6 +571,7 @@ handle_streamed_transaction(LogicalRepMsgType
> action, StringInfo s)
>   TransactionId current_xid;
>   ParallelApplyWorkerInfo *winfo;
>   TransApplyAction apply_action;
> + StringInfoData original_msg;
> 
>   apply_action = get_transaction_apply_action(stream_xid, &winfo);
> 
> @@ -573,6 +581,8 @@ handle_streamed_transaction(LogicalRepMsgType
> action, StringInfo s)
> 
>   Assert(TransactionIdIsValid(stream_xid));
> 
> + original_msg = *s;
> +
>   /*
>   * We should have received XID of the subxact as the first part of the
>   * message, so extract it.
> @@ -596,10 +606,14 @@ handle_streamed_transaction(LogicalRepMsgType
> action, StringInfo s)
>   stream_write_change(action, s);
>   return true;
> 
> + case TRANS_LEADER_PARTIAL_SERIALIZE:
>   case TRANS_LEADER_SEND_TO_PARALLEL:
>   Assert(winfo);
> 
> - pa_send_data(winfo, s->len, s->data);
> + if (apply_action == TRANS_LEADER_SEND_TO_PARALLEL) pa_send_data(winfo,
> + s->len, s->data); else stream_write_change(action, &original_msg);
> 
> Please add the comment to specify the reason to remember the original string.

Added.

> 3.
> @@ -1797,8 +1907,8 @@ apply_spooled_messages(TransactionId xid,
> XLogRecPtr lsn)
>   changes_filename(path, MyLogicalRepWorker->subid, xid);
>   elog(DEBUG1, "replaying changes from file \"%s\"", path);
> 
> - fd = BufFileOpenFileSet(MyLogicalRepWorker->stream_fileset, path,
> O_RDONLY,
> - false);
> + stream_fd = BufFileOpenFileSet(stream_fileset, path, O_RDONLY, false);
> + stream_xid = xid;
> 
> Why do we need stream_xid here? I think we can avoid having global stream_fd
> if the comment #1 is feasible.

I think we don't need it anymore, I have removed it.

> 4.
> + * TRANS_LEADER_APPLY:
> + * The action means that we
> 
> /The/This. Please make a similar change for other actions.
> 
> 5. Apart from the above, please find a few changes to the comments for
> 0001 and 0002 patches in the attached patches.

Merged.

Attach the new version patch set which addressed most of the comments received so
far except some comments being discussed[1].

[1]
https://www.postgresql.org/message-id/OS0PR01MB57167BF64FC0891734C8E81A94149%40OS0PR01MB5716.jpnprd01.prod.outlook.com

Best regards,
Hou zj

On Sunday, December 4, 2022 7:17 PM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com>
> 
> Thursday, December 1, 2022 8:40 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > Some other comments:
> ...
> Attach the new version patch set which addressed most of the comments
> received so far except some comments being discussed[1].
> [1]
https://www.postgresql.org/message-id/OS0PR01MB57167BF64FC0891734C8E81A94149%40OS0PR01MB5716.jpnprd01.prod.outlook.com

Attach a new version patch set which fixed a testcase failure on CFbot.

Best regards,
Hou zj

On Tuesday, December 6, 2022 3:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Mon, Dec 5, 2022 at 9:59 AM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Attach a new version patch set which fixed a testcase failure on CFbot.
> >
> 
> Few comments:
> ============
> 1.
> + /*
> + * Break the loop if the parallel apply worker has finished applying
> + * the transaction. The parallel apply worker should have closed the
> + * file before committing.
> + */
> + if (am_parallel_apply_worker() &&
> + MyParallelShared->xact_state == PARALLEL_TRANS_FINISHED)
> + goto done;
> 
> This looks hackish to me because ideally, this API should exit after reading and
> applying all the messages in the spool file. This check is primarily based on the
> knowledge that once we reach some state, the file won't have more data. I
> think it would be better to explicitly ensure the same.

I added a function to ensure that there is no message left after committing
the transaction.


> 2.
> + /*
> + * No need to output the DEBUG message here in the parallel apply
> + * worker as similar messages will be output when handling STREAM_STOP
> + * message.
> + */
> + if (!am_parallel_apply_worker() && nchanges % 1000 == 0)
>   elog(DEBUG1, "replayed %d changes from file \"%s\"",
>   nchanges, path);
>   }
> 
> I think this check appeared a bit ugly to me. I think it is okay to get a similar
> DEBUG message at another place (on stream_stop) because
> (a) this is logged every 1000 messages whereas stream_stop can be after many
> more messages, so there doesn't appear to be a direct correlation; (b) due to
> this, we can identify whether it is due to spooled messages or due to direct
> apply; ideally we can use another DEBUG message to differentiate but this
> doesn't appear bad to me.

OK, I removed this check.

> 3. The function names for serialize_stream_start(), serialize_stream_stop(), and
> serialize_stream_abort() don't seem to match the functionality they provide
> because none of these write/serialize changes to the file. Can we rename
> these? Some possible options could be stream_start_internal or
> stream_start_guts.

Renamed to stream_start_internal().

Attach the new version patch set which addressed above comments.
I also attach a new patch to force stream change(provided by Shi-san) and
another one that introduce a GUC stream_serialize_threshold (provided by
Kuroda-san and Shi-san) which can help testing the patch set.

Besides, I fixed a bug where there could still be messages left in memory
queue and the PA has started to apply spooled message.

Best regards,
Hou zj

On Wednesday, December 7, 2022 6:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Wed, Dec 7, 2022 at 8:28 AM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Besides, I fixed a bug where there could still be messages left in
> > memory queue and the PA has started to apply spooled message.
> >
> 
> Few comments on the recent changes in the patch:
> ========================================
> 1. It seems you need to set FS_SERIALIZE_DONE in
> stream_prepare/commit/abort. They are still directly setting the state as
> READY. Am, I missing something or you forgot to change it?

It's my miss, changed.

> 2.
>   case TRANS_PARALLEL_APPLY:
>   pa_stream_abort(&abort_data);
> 
> + /*
> + * Reset the stream_fd after aborting the toplevel transaction in
> + * case the parallel apply worker is applying spooled messages */ if
> + (toplevel_xact) stream_fd = NULL;
> 
> I think we can keep the handling of stream file the same in
> abort/commit/prepare code path.

Changed.

> 3. It is already pointed out by Peter that it is better to add some comments in
> pa_spooled_messages() function that we won't be immediately able to apply
> changes after the lock is released, it will be done in the next cycle.

Added.

> 4. Shall we rename FS_SERIALIZE as FS_SERIALIZE_IN_PROGRESS? That will
> appear consistent with FS_SERIALIZE_DONE.

Agreed, changed.

> 5. Comment improvements:
> diff --git a/src/backend/replication/logical/worker.c
> b/src/backend/replication/logical/worker.c
> index b26d587ae4..921d973863 100644
> --- a/src/backend/replication/logical/worker.c
> +++ b/src/backend/replication/logical/worker.c
> @@ -1934,8 +1934,7 @@ apply_handle_stream_abort(StringInfo s)  }
> 
>  /*
> - * Check if the passed fileno and offset are the last fileno and position of
> - * the fileset, and report an ERROR if not.
> + * Ensure that the passed location is fileset's end.
>   */
>  static void
>  ensure_last_message(FileSet *stream_fileset, TransactionId xid, int fileno, @@
> -2084,9 +2083,9 @@ apply_spooled_messages(FileSet *stream_fileset,
> TransactionId xid,
>                 nchanges++;
> 
>                 /*
> -                * Break the loop if stream_fd is set to NULL which
> means the parallel
> -                * apply worker has finished applying the transaction.
> The parallel
> -                * apply worker should have closed the file before committing.
> +                * It is possible the file has been closed because we
> have processed
> +                * some transaction end message like stream_commit in
> which case that
> +                * must be the last message.
>                  */

Merged, thanks.

Attach the new version patch which addressed all above comments and part of
comments from[1] except some comments that are being discussed.

Apart from above, according to the comment from Amit and Sawada-san[2], the new
version patch won't stop the parallel worker due to subscription parameter
change, it will absorb the change instead, and the leader will anyway detect
the parameter change and stop all workers later.

Based on this, I also removed the maybe_reread_subscription() call in parallel
apply worker's main loop, because we need to make sure we won't update the local
subscription parameter in the middle of the transaction. And we will call
maybe_reread_subscription() before starting a transaction in parallel apply
worker anyway(in maybe_reread_subscription()), so remove that check is fine and
can save some codes.

[1] https://www.postgresql.org/message-id/CAD21AoCZ3i9w1Rz-81Lv1QB%2BJGP60Ypiom4%2BwM9eP3aQTx0STQ%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAD21AoAzYstJVM0nMVnXZoeYamqD2j92DkWVH%3DYbGtA4yzy19A%40mail.gmail.com

Best regards,
Hou zj

On Friday, December 9, 2022 3:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Thu, Dec 8, 2022 at 12:37 PM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> 
> Review comments

Thanks for the comments!

> ==============
> 1. Currently, we don't release the stream lock in LA (leade apply
> worker) for "rollback to savepoint" and the reason is mentioned in comments of
> apply_handle_stream_abort() in the patch. But, today, while testing, I found that
> can lead to deadlock which otherwise, won't happen on the publisher. The key
> point is rollback to savepoint releases the locks acquired by the particular
> subtransaction, so parallel apply worker should also do the same. Consider the
> following example where the transaction in session-1 is being performed by the
> parallel apply worker and the transaction in session-2 is being performed by the
> leader apply worker. I have simulated it by using GUC force_stream_mode.
> Publisher
> ==========
> Session-1
> postgres=# begin;
> BEGIN
> postgres=*# savepoint s1;
> SAVEPOINT
> postgres=*# truncate t1;
> TRUNCATE TABLE
> 
> Session-2
> postgres=# begin;
> BEGIN
> postgres=*# insert into t1 values(4);
> 
> Session-1
> postgres=*# rollback to savepoint s1;
> ROLLBACK
> 
> Session-2
> Commit;
> 
> With or without commit of Session-2, this scenario will lead to deadlock on the
> subscriber because PA (parallel apply worker) is waiting for LA to send the next
> command, and LA is blocked by Exclusive of PA. There is no deadlock on the
> publisher because rollback to savepoint will release the lock acquired by
> truncate.
> 
> To solve this, How about if we do three things before sending abort of
> sub-transaction (a) unlock the stream lock, (b) increment pending_stream_count,
> (c) take the stream lock again?
> 
> Now, if the PA is not already waiting on the stop, it will not wait at stream_stop
> but will wait after applying abort of sub-transaction and if it is already waiting at
> stream_stop, the wait will be released. If this works then probably we should try
> to do (b) before (a) to match the steps with stream_start.

The solution works for me, I have changed the code as suggested.


> 2. There seems to be another general problem in the way the patch waits for
> stream_stop in PA (parallel apply worker). Currently, PA checks, if there are no
> more pending streams then it tries to wait for the next stream by waiting on a
> stream lock. However, it is possible after PA checks there is no pending stream
> and before it actually starts waiting on a lock, the LA sends another stream for
> which even stream_stop is sent, in this case, PA will start waiting for the next
> stream whereas there is actually a pending stream available. In this case, it won't
> lead to any problem apart from delay in applying the changes in such cases but
> for the case mentioned in the previous point (Pont 1), it can lead to deadlock
> even after we implement the solution proposed to solve it.

Thanks for reporting, I have introduced another flag in shared memory and use it to
prevent the leader from incrementing the pending_stream_count if the parallel
apply worker is trying to lock the stream lock.


> 3. The other point to consider is that for stream_commit/prepare/abort, in LA, we
> release the stream lock after sending the message whereas for stream_start we
> release it before sending the message. I think for the earlier cases
> (stream_commit/prepare/abort), the patch has done like this because
> pa_send_data() may need to require the lock again when it times out and start
> serializing, so there will be no sense in first releasing it, then re-acquiring it, and
> then again releasing it. Can't we also release the lock for stream_start after
> pa_send_data() only if it is not switched to serialize mode?

Changed.

Attach the new version patch set which addressed above comments.
Besides, the new version patch will try to stop extra parallel workers if user
sets the max_parallel_apply_workers_per_subscription to a lower number.

Best regards,
Hou zj

On Tuesday, December 13, 2022 6:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Tue, Dec 13, 2022 at 4:36 AM Peter Smith <smithpb2250@gmail.com>
> wrote:
> >
> > ~~~
> >
> > 3. pa_set_stream_apply_worker
> >
> > +/*
> > + * Set the worker that required to apply the current streaming transaction.
> > + */
> > +void
> > +pa_set_stream_apply_worker(ParallelApplyWorkerInfo *winfo) {
> > +stream_apply_worker = winfo; }
> >
> > Comment wording seems wrong.
> >
> 
> I think something like "Cache the parallel apply worker information."
> may be more suitable here.

Changed.

> Few more similar cosmetic comments:
> 1.
> + /*
> + * Unlock the shared object lock so that the parallel apply worker
> + * can continue to receive changes.
> + */
> + if (!first_segment)
> + pa_unlock_stream(winfo->shared->xid, AccessExclusiveLock);
> 
> This comment is missing in the new (0002) patch.

Added.

> 2.
> + if (!winfo->serialize_changes)
> + {
> + if (!first_segment)
> + pa_unlock_stream(winfo->shared->xid, AccessExclusiveLock);
> 
> I think we should write some comments on why we are not unlocking when
> serializing changes.

Added.

> 3. Please add a comment like below in the patch to make it clear why in
> stream_abort case we perform locking before sending the message.
> --- a/src/backend/replication/logical/worker.c
> +++ b/src/backend/replication/logical/worker.c
> @@ -1858,6 +1858,13 @@ apply_handle_stream_abort(StringInfo s)
>                          * worker will wait on the lock for the next set of
> changes after
>                          * processing the STREAM_ABORT message if it is not
> already waiting
>                          * for STREAM_STOP message.
> +                        *
> +                        * It is important to perform this locking
> before sending the
> +                        * STREAM_ABORT message so that the leader can
> hold the lock first
> +                        * and the parallel apply worker will wait for
> the leader to release
> +                        * the lock. This is the same as what we do in
> +                        * apply_handle_stream_stop. See Locking
> Considerations atop
> +                        * applyparallelworker.c.
>                          */
>                         if (!toplevel_xact)

Merged.

Attach the new version patch which addressed above comments.
I also slightly refactored logic related to pa_spooled_messages() so that
It doesn't need to wait for a timeout if there are pending spooled messages.

Best regards,
Hou zj

Hi,

I did some performance tests for this patch, based on v59-0001 and v59-0002
patch.

This test used synchronous logical replication, and compared SQL execution times
before and after applying the patch.

Two cases are tested by varying logical_decoding_work_mem:
a) Bulk insert.
b) Rollback to savepoint. (Different percentage of changes in the transaction
are rolled back).

The test was performed ten times, and the average of the middle eight was taken.

The results are as follows. The bar charts are attached.
(The steps are the same as before.[1])

RESULT - bulk insert (5kk)
---------------------------------------------------------------
logical_decoding_work_mem   64kB        256kB       64MB
HEAD                        51.655      51.694      51.262
patched                     31.104      31.234      31.711
Compare with HEAD           -39.79%     -39.58%     -38.14%

RESULT - rollback 10% (5kk)
---------------------------------------------------------------
logical_decoding_work_mem   64kB        256kB       64MB
HEAD                        43.908      43.358      42.874
patched                     31.924      31.343      29.102
Compare with HEAD           -27.29%     -27.71%     -32.12%

RESULT - rollback 20% (5kk)
---------------------------------------------------------------
logical_decoding_work_mem   64kB        256kB       64MB
HEAD                        40.561      40.599      40.015
patched                     31.562      32.116      29.680
Compare with HEAD           -22.19%     -20.89%     -25.83%

RESULT - rollback 30% (5kk)
---------------------------------------------------------------
logical_decoding_work_mem   64kB        256kB       64MB
HEAD                        38.092      37.756      37.142
patched                     31.631      31.236      28.783
Compare with HEAD           -16.96%     -17.27%      -22.50%

RESULT - rollback 50% (5kk)
---------------------------------------------------------------
logical_decoding_work_mem   64kB        256kB       64MB
HEAD                        33.387      33.056      32.638
patched                     31.272      31.279      29.876
Compare with HEAD           -6.34%      -5.38%      -8.46%

(If "Compare with HEAD" is a positive number, it means worse than HEAD; if it is
a negative number, it means better than HEAD.)

Summary:
In the case of bulk insert, it takes about 30% ~ 40% less time, which looks good
to me.
In the case of rollback to savepoint, the larger the amount of data rolled back,
the smaller the improvement compared to HEAD. But as such cases won't be often,
this should be okay.

[1]
https://www.postgresql.org/message-id/OSZPR01MB63103AA97349BBB858E27DEAFD499%40OSZPR01MB6310.jpnprd01.prod.outlook.com

Regards,
Shi yu

Attachment

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

14 December 2022, 06:48:41

On Wed, Dec 14, 2022 at 9:50 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, December 13, 2022 11:25 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Here are comments on v59 0001, 0002 patches:
>
> Thanks for the comments!
>
> > +void
> > +pa_increment_stream_block(ParallelApplyWorkerShared *wshared) {
> > +        while (1)
> > +        {
> > +                SpinLockAcquire(&wshared->mutex);
> > +
> > +                /*
> > +                 * Don't try to increment the count if the parallel
> > apply worker is
> > +                 * taking the stream lock. Otherwise, there would be
> > a race condition
> > +                 * that the parallel apply worker checks there is no
> > pending streaming
> > +                 * block and before it actually starts waiting on a
> > lock, the leader
> > +                 * sends another streaming block and take the stream
> > lock again. In
> > +                 * this case, the parallel apply worker will start
> > waiting for the next
> > +                 * streaming block whereas there is actually a
> > pending streaming block
> > +                 * available.
> > +                 */
> > +                if (!wshared->pa_wait_for_stream)
> > +                {
> > +                        wshared->pending_stream_count++;
> > +                        SpinLockRelease(&wshared->mutex);
> > +                        break;
> > +                }
> > +
> > +                SpinLockRelease(&wshared->mutex);
> > +        }
> > +}
> >
> > I think we should add an assertion to check if we don't hold the stream lock.
> >
> > I think that waiting for pa_wait_for_stream to be false in a busy loop is not a
> > good idea. It's not interruptible and there is not guarantee that we can break
> > from this loop in a short time. For instance, if PA executes
> > pa_decr_and_wait_stream_block() a bit earlier than LA executes
> > pa_increment_stream_block(), LA has to wait for PA to acquire and release the
> > stream lock in a busy loop. It should not be long in normal cases but the
> > duration LA needs to wait for PA depends on PA, which could be long. Also
> > what if PA raises an error in
> > pa_lock_stream() due to some reasons? I think LA won't be able to detect the
> > failure.
> >
> > I think we should at least make it interruptible and maybe need to add some
> > sleep. Or perhaps we can use the condition variable for this case.
>

Or we can leave this while (true) logic altogether for the first
version and have a comment to explain this race. Anyway, after
restarting, it will probably be solved. We can always change this part
of the code later if this really turns out to be problematic.

> Thanks for the analysis, I will research this part.
>
> > ---
> > In worker.c, we have the following common pattern:
> >
> > case TRANS_LEADER_PARTIAL_SERIALIZE:
> >     write change to the file;
> >     do some work;
> >     break;
> >
> > case TRANS_LEADER_SEND_TO_PARALLEL:
> >     pa_send_data();
> >
> >     if (winfo->serialize_changes)
> >     {
> >         do some worker required after writing changes to the file.
> >     }
> >     :
> >     break;
> >
> > IIUC there are two different paths for partial serialization: (a) where
> > apply_action is TRANS_LEADER_PARTIAL_SERIALIZE, and (b) where
> > apply_action is TRANS_LEADER_PARTIAL_SERIALIZE and
> > winfo->serialize_changes became true. And we need to match what we do
> > in (a) and (b). Rather than having two different paths for the same case, how
> > about falling through TRANS_LEADER_PARTIAL_SERIALIZE when we could not
> > send the changes? That is, pa_send_data() just returns false when the timeout
> > exceeds and we need to switch to serialize changes, otherwise returns true. If it
> > returns false, we prepare for switching to serialize changes such as initializing
> > fileset, and fall through TRANS_LEADER_PARTIAL_SERIALIZE case. The code
> > would be like:
> >
> > case TRANS_LEADER_SEND_TO_PARALLEL:
> >     ret = pa_send_data();
> >
> >     if (ret)
> >     {
> >         do work for sending changes to PA.
> >         break;
> >     }
> >
> >     /* prepare for switching to serialize changes */
> >     winfo->serialize_changes = true;
> >     initialize fileset;
> >     acquire stream lock if necessary;
> >
> >     /* FALLTHROUGH */
> > case TRANS_LEADER_PARTIAL_SERIALIZE:
> >     do work for serializing changes;
> >     break;
>
> I think that the suggestion is to extract the code that switch to serialize
> mode out of the pa_send_data(), and then we need to add that logic in all the
> functions which call pa_send_data(), I am not sure if it looks better as it
> might introduce some more codes in each handling function.
>

How about extracting the common code from apply_handle_stream_commit
and apply_handle_stream_prepare to a separate function say
pa_xact_finish_common()? I see there is a lot of common code (unlock
the stream, wait for the finish, store flush location, free worker
info) in both the functions for TRANS_LEADER_PARTIAL_SERIALIZE and
TRANS_LEADER_SEND_TO_PARALLEL cases.

>
> > ---
> >  void
> > pa_lock_stream(TransactionId xid, LOCKMODE lockmode) {
> >     LockApplyTransactionForSession(MyLogicalRepWorker->subid, xid,
> >                                    PARALLEL_APPLY_LOCK_STREAM,
> > lockmode); }
> >
> > I think since we don't need to let the caller to specify the lock mode but need
> > only shared and exclusive modes, we can make it simple by having a boolean
> > argument say shared instead of lockmode.
>
> I personally think passing the lockmode would make the code more clear
> than passing a Boolean value.
>

+1.

I have made a few changes in the newly added comments and function
name in the attached patch. Kindly include this if you find the
changes okay.

-- 
With Regards,
Amit Kapila.

Attachment

changes_amit_v60.patch

Re: Perform streaming logical transactions by background workers and parallel apply

From

Masahiko Sawada

Date:

14 December 2022, 09:19:11

On Wed, Dec 14, 2022 at 1:20 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, December 13, 2022 11:25 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Sun, Dec 11, 2022 at 8:45 PM houzj.fnst@fujitsu.com
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > On Friday, December 9, 2022 3:14 PM Amit Kapila
> > <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Thu, Dec 8, 2022 at 12:37 PM houzj.fnst@fujitsu.com
> > > > <houzj.fnst@fujitsu.com> wrote:
> > > > >
> > > >
> > > > Review comments
> > >
> > > Thanks for the comments!
> > >
> > > > ==============
> > > > 1. Currently, we don't release the stream lock in LA (leade apply
> > > > worker) for "rollback to savepoint" and the reason is mentioned in
> > > > comments of
> > > > apply_handle_stream_abort() in the patch. But, today, while testing,
> > > > I found that can lead to deadlock which otherwise, won't happen on
> > > > the publisher. The key point is rollback to savepoint releases the
> > > > locks acquired by the particular subtransaction, so parallel apply
> > > > worker should also do the same. Consider the following example where
> > > > the transaction in session-1 is being performed by the parallel
> > > > apply worker and the transaction in session-2 is being performed by the
> > leader apply worker. I have simulated it by using GUC force_stream_mode.
> > > > Publisher
> > > > ==========
> > > > Session-1
> > > > postgres=# begin;
> > > > BEGIN
> > > > postgres=*# savepoint s1;
> > > > SAVEPOINT
> > > > postgres=*# truncate t1;
> > > > TRUNCATE TABLE
> > > >
> > > > Session-2
> > > > postgres=# begin;
> > > > BEGIN
> > > > postgres=*# insert into t1 values(4);
> > > >
> > > > Session-1
> > > > postgres=*# rollback to savepoint s1; ROLLBACK
> > > >
> > > > Session-2
> > > > Commit;
> > > >
> > > > With or without commit of Session-2, this scenario will lead to
> > > > deadlock on the subscriber because PA (parallel apply worker) is
> > > > waiting for LA to send the next command, and LA is blocked by
> > > > Exclusive of PA. There is no deadlock on the publisher because
> > > > rollback to savepoint will release the lock acquired by truncate.
> > > >
> > > > To solve this, How about if we do three things before sending abort
> > > > of sub-transaction (a) unlock the stream lock, (b) increment
> > > > pending_stream_count,
> > > > (c) take the stream lock again?
> > > >
> > > > Now, if the PA is not already waiting on the stop, it will not wait
> > > > at stream_stop but will wait after applying abort of sub-transaction
> > > > and if it is already waiting at stream_stop, the wait will be
> > > > released. If this works then probably we should try to do (b) before (a) to
> > match the steps with stream_start.
> > >
> > > The solution works for me, I have changed the code as suggested.
> > >
> > >
> > > > 2. There seems to be another general problem in the way the patch
> > > > waits for stream_stop in PA (parallel apply worker). Currently, PA
> > > > checks, if there are no more pending streams then it tries to wait
> > > > for the next stream by waiting on a stream lock. However, it is
> > > > possible after PA checks there is no pending stream and before it
> > > > actually starts waiting on a lock, the LA sends another stream for
> > > > which even stream_stop is sent, in this case, PA will start waiting
> > > > for the next stream whereas there is actually a pending stream
> > > > available. In this case, it won't lead to any problem apart from
> > > > delay in applying the changes in such cases but for the case mentioned in
> > the previous point (Pont 1), it can lead to deadlock even after we implement the
> > solution proposed to solve it.
> > >
> > > Thanks for reporting, I have introduced another flag in shared memory
> > > and use it to prevent the leader from incrementing the
> > > pending_stream_count if the parallel apply worker is trying to lock the stream
> > lock.
> > >
> > >
> > > > 3. The other point to consider is that for
> > > > stream_commit/prepare/abort, in LA, we release the stream lock after
> > > > sending the message whereas for stream_start we release it before
> > > > sending the message. I think for the earlier cases
> > > > (stream_commit/prepare/abort), the patch has done like this because
> > > > pa_send_data() may need to require the lock again when it times out
> > > > and start serializing, so there will be no sense in first releasing
> > > > it, then re-acquiring it, and then again releasing it. Can't we also
> > > > release the lock for stream_start after
> > > > pa_send_data() only if it is not switched to serialize mode?
> > >
> > > Changed.
> > >
> > > Attach the new version patch set which addressed above comments.
> >
> > Here are comments on v59 0001, 0002 patches:
>
> Thanks for the comments!
>
> > +void
> > +pa_increment_stream_block(ParallelApplyWorkerShared *wshared) {
> > +        while (1)
> > +        {
> > +                SpinLockAcquire(&wshared->mutex);
> > +
> > +                /*
> > +                 * Don't try to increment the count if the parallel
> > apply worker is
> > +                 * taking the stream lock. Otherwise, there would be
> > a race condition
> > +                 * that the parallel apply worker checks there is no
> > pending streaming
> > +                 * block and before it actually starts waiting on a
> > lock, the leader
> > +                 * sends another streaming block and take the stream
> > lock again. In
> > +                 * this case, the parallel apply worker will start
> > waiting for the next
> > +                 * streaming block whereas there is actually a
> > pending streaming block
> > +                 * available.
> > +                 */
> > +                if (!wshared->pa_wait_for_stream)
> > +                {
> > +                        wshared->pending_stream_count++;
> > +                        SpinLockRelease(&wshared->mutex);
> > +                        break;
> > +                }
> > +
> > +                SpinLockRelease(&wshared->mutex);
> > +        }
> > +}
> >
> > I think we should add an assertion to check if we don't hold the stream lock.
> >
> > I think that waiting for pa_wait_for_stream to be false in a busy loop is not a
> > good idea. It's not interruptible and there is not guarantee that we can break
> > from this loop in a short time. For instance, if PA executes
> > pa_decr_and_wait_stream_block() a bit earlier than LA executes
> > pa_increment_stream_block(), LA has to wait for PA to acquire and release the
> > stream lock in a busy loop. It should not be long in normal cases but the
> > duration LA needs to wait for PA depends on PA, which could be long. Also
> > what if PA raises an error in
> > pa_lock_stream() due to some reasons? I think LA won't be able to detect the
> > failure.
> >
> > I think we should at least make it interruptible and maybe need to add some
> > sleep. Or perhaps we can use the condition variable for this case.
>
> Thanks for the analysis, I will research this part.
>
> > ---
> > In worker.c, we have the following common pattern:
> >
> > case TRANS_LEADER_PARTIAL_SERIALIZE:
> >     write change to the file;
> >     do some work;
> >     break;
> >
> > case TRANS_LEADER_SEND_TO_PARALLEL:
> >     pa_send_data();
> >
> >     if (winfo->serialize_changes)
> >     {
> >         do some worker required after writing changes to the file.
> >     }
> >     :
> >     break;
> >
> > IIUC there are two different paths for partial serialization: (a) where
> > apply_action is TRANS_LEADER_PARTIAL_SERIALIZE, and (b) where
> > apply_action is TRANS_LEADER_PARTIAL_SERIALIZE and
> > winfo->serialize_changes became true. And we need to match what we do
> > in (a) and (b). Rather than having two different paths for the same case, how
> > about falling through TRANS_LEADER_PARTIAL_SERIALIZE when we could not
> > send the changes? That is, pa_send_data() just returns false when the timeout
> > exceeds and we need to switch to serialize changes, otherwise returns true. If it
> > returns false, we prepare for switching to serialize changes such as initializing
> > fileset, and fall through TRANS_LEADER_PARTIAL_SERIALIZE case. The code
> > would be like:
> >
> > case TRANS_LEADER_SEND_TO_PARALLEL:
> >     ret = pa_send_data();
> >
> >     if (ret)
> >     {
> >         do work for sending changes to PA.
> >         break;
> >     }
> >
> >     /* prepare for switching to serialize changes */
> >     winfo->serialize_changes = true;
> >     initialize fileset;
> >     acquire stream lock if necessary;
> >
> >     /* FALLTHROUGH */
> > case TRANS_LEADER_PARTIAL_SERIALIZE:
> >     do work for serializing changes;
> >     break;
>
> I think that the suggestion is to extract the code that switch to serialize
> mode out of the pa_send_data(), and then we need to add that logic in all the
> functions which call pa_send_data(), I am not sure if it looks better as it
> might introduce some more codes in each handling function.

I think we can have a common function to prepare for switching to
serialize changes. With the current code, I'm concerned that we have
to check if what we do in both cases are matched whenever we change
the code for the partial serialization case.

> > ---
> >  void
> > pa_lock_stream(TransactionId xid, LOCKMODE lockmode) {
> >     LockApplyTransactionForSession(MyLogicalRepWorker->subid, xid,
> >                                    PARALLEL_APPLY_LOCK_STREAM,
> > lockmode); }
> >
> > I think since we don't need to let the caller to specify the lock mode but need
> > only shared and exclusive modes, we can make it simple by having a boolean
> > argument say shared instead of lockmode.
>
> I personally think passing the lockmode would make the code more clear
> than passing a Boolean value.

Okay, agreed.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

15 December 2022, 03:28:25

On Wednesday, December 14, 2022 2:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

> 
> On Wed, Dec 14, 2022 at 9:50 AM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Tuesday, December 13, 2022 11:25 PM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> > >
> > > Here are comments on v59 0001, 0002 patches:
> >
> > Thanks for the comments!
> >
> > > +void
> > > +pa_increment_stream_block(ParallelApplyWorkerShared *wshared) {
> > > +        while (1)
> > > +        {
> > > +                SpinLockAcquire(&wshared->mutex);
> > > +
> > > +                /*
> > > +                 * Don't try to increment the count if the parallel
> > > apply worker is
> > > +                 * taking the stream lock. Otherwise, there would
> > > + be
> > > a race condition
> > > +                 * that the parallel apply worker checks there is
> > > + no
> > > pending streaming
> > > +                 * block and before it actually starts waiting on a
> > > lock, the leader
> > > +                 * sends another streaming block and take the
> > > + stream
> > > lock again. In
> > > +                 * this case, the parallel apply worker will start
> > > waiting for the next
> > > +                 * streaming block whereas there is actually a
> > > pending streaming block
> > > +                 * available.
> > > +                 */
> > > +                if (!wshared->pa_wait_for_stream)
> > > +                {
> > > +                        wshared->pending_stream_count++;
> > > +                        SpinLockRelease(&wshared->mutex);
> > > +                        break;
> > > +                }
> > > +
> > > +                SpinLockRelease(&wshared->mutex);
> > > +        }
> > > +}
> > >
> > > I think we should add an assertion to check if we don't hold the stream lock.
> > >
> > > I think that waiting for pa_wait_for_stream to be false in a busy
> > > loop is not a good idea. It's not interruptible and there is not
> > > guarantee that we can break from this loop in a short time. For
> > > instance, if PA executes
> > > pa_decr_and_wait_stream_block() a bit earlier than LA executes
> > > pa_increment_stream_block(), LA has to wait for PA to acquire and
> > > release the stream lock in a busy loop. It should not be long in
> > > normal cases but the duration LA needs to wait for PA depends on PA,
> > > which could be long. Also what if PA raises an error in
> > > pa_lock_stream() due to some reasons? I think LA won't be able to
> > > detect the failure.
> > >
> > > I think we should at least make it interruptible and maybe need to
> > > add some sleep. Or perhaps we can use the condition variable for this case.
> >
> 
> Or we can leave this while (true) logic altogether for the first version and have a
> comment to explain this race. Anyway, after restarting, it will probably be
> solved. We can always change this part of the code later if this really turns out
> to be problematic.

Agreed, and reverted this part.

> 
> > Thanks for the analysis, I will research this part.
> >
> > > ---
> > > In worker.c, we have the following common pattern:
> > >
> > > case TRANS_LEADER_PARTIAL_SERIALIZE:
> > >     write change to the file;
> > >     do some work;
> > >     break;
> > >
> > > case TRANS_LEADER_SEND_TO_PARALLEL:
> > >     pa_send_data();
> > >
> > >     if (winfo->serialize_changes)
> > >     {
> > >         do some worker required after writing changes to the file.
> > >     }
> > >     :
> > >     break;
> > >
> > > IIUC there are two different paths for partial serialization: (a)
> > > where apply_action is TRANS_LEADER_PARTIAL_SERIALIZE, and (b) where
> > > apply_action is TRANS_LEADER_PARTIAL_SERIALIZE and
> > > winfo->serialize_changes became true. And we need to match what we
> > > winfo->do
> > > in (a) and (b). Rather than having two different paths for the same
> > > case, how about falling through TRANS_LEADER_PARTIAL_SERIALIZE when
> > > we could not send the changes? That is, pa_send_data() just returns
> > > false when the timeout exceeds and we need to switch to serialize
> > > changes, otherwise returns true. If it returns false, we prepare for
> > > switching to serialize changes such as initializing fileset, and
> > > fall through TRANS_LEADER_PARTIAL_SERIALIZE case. The code would be
> like:
> > >
> > > case TRANS_LEADER_SEND_TO_PARALLEL:
> > >     ret = pa_send_data();
> > >
> > >     if (ret)
> > >     {
> > >         do work for sending changes to PA.
> > >         break;
> > >     }
> > >
> > >     /* prepare for switching to serialize changes */
> > >     winfo->serialize_changes = true;
> > >     initialize fileset;
> > >     acquire stream lock if necessary;
> > >
> > >     /* FALLTHROUGH */
> > > case TRANS_LEADER_PARTIAL_SERIALIZE:
> > >     do work for serializing changes;
> > >     break;
> >
> > I think that the suggestion is to extract the code that switch to
> > serialize mode out of the pa_send_data(), and then we need to add that
> > logic in all the functions which call pa_send_data(), I am not sure if
> > it looks better as it might introduce some more codes in each handling
> function.
> >
> 
> How about extracting the common code from apply_handle_stream_commit
> and apply_handle_stream_prepare to a separate function say
> pa_xact_finish_common()? I see there is a lot of common code (unlock the
> stream, wait for the finish, store flush location, free worker
> info) in both the functions for TRANS_LEADER_PARTIAL_SERIALIZE and
> TRANS_LEADER_SEND_TO_PARALLEL cases.

Agreed, changed. I also addressed Sawada-san comment by extracting the
code that switch to serialize out of pa_send_data().

> >
> > > ---
> > >  void
> > > pa_lock_stream(TransactionId xid, LOCKMODE lockmode) {
> > >     LockApplyTransactionForSession(MyLogicalRepWorker->subid, xid,
> > >                                    PARALLEL_APPLY_LOCK_STREAM,
> > > lockmode); }
> > >
> > > I think since we don't need to let the caller to specify the lock
> > > mode but need only shared and exclusive modes, we can make it simple
> > > by having a boolean argument say shared instead of lockmode.
> >
> > I personally think passing the lockmode would make the code more clear
> > than passing a Boolean value.
> >
> 
> +1.
> 
> I have made a few changes in the newly added comments and function name in
> the attached patch. Kindly include this if you find the changes okay.

Thanks, I have checked and merged it.

Attach the new version patch set which addressed all comments so far.

Best regards,
Hou zj

On Saturday, December 17, 2022 8:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Fri, Dec 16, 2022 at 4:34 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Dec 16, 2022 at 2:47 PM houzj.fnst@fujitsu.com
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > > ---
> > > > +       active_workers = list_copy(ParallelApplyWorkerPool);
> > > > +
> > > > +       foreach(lc, active_workers)
> > > > +       {
> > > > +               int                     slot_no;
> > > > +               uint16          generation;
> > > > +               ParallelApplyWorkerInfo *winfo =
> > > > (ParallelApplyWorkerInfo *) lfirst(lc);
> > > > +
> > > > +               LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
> > > > +               napplyworkers =
> > > > logicalrep_pa_worker_count(MyLogicalRepWorker->subid);
> > > > +               LWLockRelease(LogicalRepWorkerLock);
> > > > +
> > > > +               if (napplyworkers <=
> > > > max_parallel_apply_workers_per_subscription / 2)
> > > > +                       return;
> > > > +
> > > >
> > > > Calling logicalrep_pa_worker_count() with lwlock for each worker
> > > > seems not efficient to me. I think we can get the number of
> > > > workers once at the top of this function and return if it's
> > > > already lower than the maximum pool size. Otherwise, we attempt to stop
> extra workers.
> > >
> > > How about we directly check the length of worker pool list here
> > > which seems simpler and don't need to lock ?
> > >
> >
> > I don't see any problem with that. Also, if such a check is safe then
> > can't we use the same in pa_free_worker() as well? BTW, shouldn't
> > pa_stop_idle_workers() try to free/stop workers unless the active
> > number reaches below max_parallel_apply_workers_per_subscription?
> >
> 
> BTW, can we move pa_stop_idle_workers() functionality to a later patch (say into
> v61-0006*)? That way we can focus on it separately once the main patch is
> committed.

Agreed. I have addressed all the comments and did some cosmetic changes.
Attach the new version patch set.

Best regards,
Hou zj

Attachment

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

19 December 2022, 12:47:25

On Sat, Dec 17, 2022 at 7:34 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> Agreed. I have addressed all the comments and did some cosmetic changes.
> Attach the new version patch set.
>

Few comments:
============
1.
+ if (fileset_state == FS_SERIALIZE_IN_PROGRESS)
+ {
+ pa_lock_stream(MyParallelShared->xid, AccessShareLock);
+ pa_unlock_stream(MyParallelShared->xid, AccessShareLock);
+ }
+
+ /*
+ * We cannot read the file immediately after the leader has serialized all
+ * changes to the file because there may still be messages in the memory
+ * queue. We will apply all spooled messages the next time we call this
+ * function, which should ensure that there are no messages left in the
+ * memory queue.
+ */
+ else if (fileset_state == FS_SERIALIZE_DONE)
+ {

Once we have waited in the FS_SERIALIZE_IN_PROGRESS, the file state
can be FS_SERIALIZE_DONE immediately after that. So, won't it be
better to have a separate if block for FS_SERIALIZE_DONE state? If you
agree to do so then we can probably remove the comment: "* XXX It is
possible that immediately after we have waited for a lock in ...".

2.
+void
+pa_decr_and_wait_stream_block(void)
+{
+ Assert(am_parallel_apply_worker());
+
+ if (pg_atomic_sub_fetch_u32(&MyParallelShared->pending_stream_count, 1) == 0)

I think here the count can go negative when we are in serialize mode
because we don't increase it for serialize mode. I can't see any
problem due to that but OTOH, this doesn't seem to be intended because
in the future if we decide to implement the functionality of switching
back to non-serialize mode, this could be a problem. Also, I guess we
don't even need to try locking/unlocking the stream lock in that case.
One idea to avoid this is to check if the pending count is zero then
if file_set in not available raise an error (elog ERROR), otherwise,
simply return from here.

3. In apply_handle_stream_stop(), we are setting backendstate as idle
for cases TRANS_LEADER_SEND_TO_PARALLEL and TRANS_PARALLEL_APPLY. For
other cases, it is set by stream_stop_internal. I think it would be
better to set the state explicitly for all cases to make the code look
consistent and remove it from stream_stop_internal(). The other reason
to remove setting the state from stream_stop_internal() is that when
that function is invoked from other places like
apply_handle_stream_commit(), it seems to be setting the idle before
actually we reach the idle state.

4. Apart from the above, I have made a few changes in the comments,
see attached.

-- 
With Regards,
Amit Kapila.

Attachment

changes_amit_1.patch

Re: Perform streaming logical transactions by background workers and parallel apply

From

Peter Smith

Date:

20 December 2022, 02:47:18

Hi, I have done some testing for this patch. This post describes my
tests so far and the results observed.

Background - Testing multiple PA workers:
---------------------------------------

The "parallel apply" feature allocates the PA workers (if it can) upon
receiving STREAM_START replication protocol msg. This means that if
there are replication messages for overlapping streaming transactions
you should see multiple PA workers processing them (assuming the PA
pool size is configured appropriately).

But AFAIK the only way to cause replication protocol messages to
arrive and be applied in a particular order is by manual testing (e.g
use 2x psql sessions and manually arrange for there to be overlapping
transactions for the published table). I have tried to make this kind
of (regression) testing easier -- in order to test many overlapping
combinations in a repeatable and semi-automated way I have posted a
small enhancement to the isolationtester spec grammar [1]. Using this,
now we can just press a button to test lots of different streaming
transaction combinations and then observe the parallel apply message
dispatching in action...

Test message combinations (from specs/pub-sub.spec):
----------------------------------------------------

# single tx
permutation ps1_begin ps1_ins ps1_commit ps1_sel ps2_sel sub_sleep sub_sel
permutation ps2_begin ps2_ins ps2_commit ps1_sel ps2_sel sub_sleep sub_sel

# rollback
permutation ps1_begin ps1_ins ps1_rollback ps1_sel sub_sleep sub_sel

# overlapping tx rollback and commit
permutation ps1_begin ps1_ins ps2_begin ps2_ins ps1_rollback
ps2_commit sub_sleep sub_sel
permutation ps1_begin ps1_ins ps2_begin ps2_ins ps1_commit
ps2_rollback sub_sleep sub_sel

# overlapping tx commits
permutation ps1_begin ps1_ins ps2_begin ps2_ins ps2_commit ps1_commit
sub_sleep sub_sel
permutation ps1_begin ps1_ins ps2_begin ps2_ins ps1_commit ps2_commit
sub_sleep sub_sel

permutation ps1_begin ps2_begin ps1_ins ps2_ins ps2_commit ps1_commit
sub_sleep sub_sel
permutation ps1_begin ps2_begin ps1_ins ps2_ins ps1_commit ps2_commit
sub_sleep sub_sel

permutation ps1_begin ps2_begin ps2_ins ps1_ins ps2_commit ps1_commit
sub_sleep sub_sel
permutation ps1_begin ps2_begin ps2_ins ps1_ins ps1_commit ps2_commit
sub_sleep sub_sel

Test setup:
-----------

1. Setup publisher and subscriber servers

1a. Publisher server is configured to use new GUC 'force_stream_mode =
true' [2]. This means even single-row inserts cause replication
STREAM_START messages which will trigger the PA workers.

1b. Subscriber server is configured to use new GUC
'max_parallel_apply_workers_per_subscription'. Set this value to
change how many PA workers can be allocated.

2. isolation/specs/pub-test.spec (defines the publisher sessions being tested)


How verified:
-------------

1. Running the isolationtester pub-sub.spec test gives the expected
table results (so data was replicated OK)
- any new permutations can be added as required.
- more overlapping sessions (e.g. 3 or 4...) can be added as required.

2. Changing the publisher GUC 'force_stream_mode' to be true/false
- we can see if PA workers being used or not being used -- (ps -eaf |
grep 'logical replication')

3. Changing the subscriber GUC 'max_parallel_apply_workers_per_subscription'
- set to high value or low value so we can see the PA worker (pool)
being used or filling to capacity

4. I have also patched some temporary logging into code for both "LA"
and "PA" workers
- now the subscriber logfile leaves a trail of evidence about which
worker did what (for apply_dispatch and for locking calls)

Observed Results:
-----------------

1. From the user's POV everything is normal - data gets replicated as
expected regardless of GUC settings (force_streaming /
max_parallel_apply_workers_per_subscription).

[postgres@CentOS7-x64 isolation]$ make check-pub-sub
...
============== creating temporary instance            ==============
============== initializing database system           ==============
============== starting postmaster                    ==============
running on port 61696 with PID 11822
============== creating database "isolation_regression" ==============
CREATE DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
============== running regression test queries        ==============
test pub-sub                      ... ok        33424 ms
============== shutting down postmaster               ==============
============== removing temporary instance            ==============

=====================
 All 1 tests passed.
=====================


2. Confirmation multiple PA workers were used (force_streaming=true /
max_parallel_apply_workers_per_subscription=99)

[postgres@CentOS7-x64 isolation]$ ps -eaf | grep 'logical replication'
postgres  5298  5293  0 Dec19 ?        00:00:00 postgres: logical
replication launcher
postgres  5306  5301  0 Dec19 ?        00:00:00 postgres: logical
replication launcher
postgres 17301  5301  0 10:31 ?        00:00:00 postgres: logical
replication parallel apply worker for subscription 16387
postgres 17524  5301  0 10:31 ?        00:00:00 postgres: logical
replication parallel apply worker for subscription 16387
postgres 21134  5301  0 08:08 ?        00:00:01 postgres: logical
replication apply worker for subscription 16387
postgres 22377 13260  0 10:34 pts/0    00:00:00 grep --color=auto
logical replication

3. Confirmation no PA workers were used when not streaming
(force_streaming=false /
max_parallel_apply_workers_per_subscription=99)

[postgres@CentOS7-x64 isolation]$ ps -eaf | grep 'logical replication'
postgres 26857 26846  0 10:37 ?        00:00:00 postgres: logical
replication launcher
postgres 26875 26864  0 10:37 ?        00:00:00 postgres: logical
replication launcher
postgres 26889 26864  0 10:37 ?        00:00:00 postgres: logical
replication apply worker for subscription 16387
postgres 29901 13260  0 10:39 pts/0    00:00:00 grep --color=auto
logical replication

4. Confirmation only one PA worker gets used when the pool is limited
(force_streaming=true / max_parallel_apply_workers_per_subscription=1)

4a. (processes)
[postgres@CentOS7-x64 isolation]$ ps -eaf | grep 'logical replication'
postgres  2484 13260  0 10:42 pts/0    00:00:00 grep --color=auto
logical replication
postgres 32500 32495  0 10:40 ?        00:00:00 postgres: logical
replication launcher
postgres 32508 32503  0 10:40 ?        00:00:00 postgres: logical
replication launcher
postgres 32514 32503  0 10:41 ?        00:00:00 postgres: logical
replication apply worker for subscription 16387

4b. (logs)
2022-12-20 10:41:43.551 AEDT [32514] LOG:  out of parallel apply workers
2022-12-20 10:41:43.551 AEDT [32514] HINT:  You might need to increase
max_parallel_apply_workers_per_subscription.
2022-12-20 10:41:43.551 AEDT [32514] CONTEXT:  processing remote data
for replication origin "pg_16387" during message type "STREAM START"
in transaction 756

5. Confirmation no PA workers get used when there is none available
(force_streaming=true / max_parallel_apply_workers_per_subscription=0)

5a. (processes)
[postgres@CentOS7-x64 isolation]$ ps -eaf | grep 'logical replication'
postgres 10026 10021  0 10:47 ?        00:00:00 postgres: logical
replication launcher
postgres 10034 10029  0 10:47 ?        00:00:00 postgres: logical
replication launcher
postgres 10041 10029  0 10:47 ?        00:00:00 postgres: logical
replication apply worker for subscription 16387
postgres 13068 13260  0 10:48 pts/0    00:00:00 grep --color=auto
logical replication

5b. (logs)
2022-12-20 10:47:50.216 AEDT [10041] LOG:  out of parallel apply workers
2022-12-20 10:47:50.216 AEDT [10041] HINT:  You might need to increase
max_parallel_apply_workers_per_subscription.
..
Also, there are no "PA" log messages present


Summary
-------

In summary, everything I have tested so far appeared to be working
properly. In other words, for overlapping streamed transactions of
different kinds, and regardless of whether zero/some/all of those
transactions are getting processed by a PA worker, the resulting
replicated data looked consistently OK.


PSA some files
- test_init.sh - sample test script for setup publisher/subscriber
required by spec test.
- spec/pub-sub.spec = spec combinations for causing overlapping
streaming transactions
- pub-sub.out = output from successful isolationtester (make check-pub-sub) run
- SUB.log = subscriber logs augmented with my "LA" and "PA" extra
logging for showing locking/dispatching.

(I can also post my logging patch if anyone is interested to try using
it to see the output like in SUB.log).

NOTE - all testing described in this post above was using v58-0001
only. However, the point of implementing these as a .spec test was to
be able to repeat these same regression tests on newer versions with
minimal manual steps required. Later I plan to fetch/apply the most
recent patch version and repeat these same tests.

------
[1] My isolationtester conninfo enhancement v2 -
https://www.postgresql.org/message-id/CAHut%2BPv_1Mev0709uj_OjyNCzfBjENE3RD9%3Dd9RZYfcqUKfG%3DA%40mail.gmail.com
[2] Shi-san's GUC 'force_streaming_mode' -

https://www.postgresql.org/message-id/flat/OSZPR01MB63104E7449DBE41932DB19F1FD1B9%40OSZPR01MB6310.jpnprd01.prod.outlook.com

Kind Regards,
Peter Smith.
Fujitsu Australia

On Mon, Dec 19, 2022 at 6:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sat, Dec 17, 2022 at 7:34 PM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Agreed. I have addressed all the comments and did some cosmetic changes.
> > Attach the new version patch set.
> >
>
> Few comments:
> ============
>

Few more minor points:
1.
-static inline void
+void
 changes_filename(char *path, Oid subid, TransactionId xid)
 {

This function seems to be used only in worker.c. So, what is the need
to make it extern?

2. I have made a few changes in the comments. See attached. This is
atop my yesterday's top-up patch.

I think we should merge the 0001 and 0002 patches as they need to be
committed together.

-- 
With Regards,
Amit Kapila.

Attachment

changes_amit_2.patch

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

20 December 2022, 10:14:49

On Monday, December 19, 2022 8:47 PMs Amit Kapila <amit.kapila16@gmail.com>:
> 
> On Sat, Dec 17, 2022 at 7:34 PM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Agreed. I have addressed all the comments and did some cosmetic changes.
> > Attach the new version patch set.
> >
> 
> Few comments:
> ============
> 1.
> + if (fileset_state == FS_SERIALIZE_IN_PROGRESS) {
> + pa_lock_stream(MyParallelShared->xid, AccessShareLock);
> + pa_unlock_stream(MyParallelShared->xid, AccessShareLock); }
> +
> + /*
> + * We cannot read the file immediately after the leader has serialized
> + all
> + * changes to the file because there may still be messages in the
> + memory
> + * queue. We will apply all spooled messages the next time we call this
> + * function, which should ensure that there are no messages left in the
> + * memory queue.
> + */
> + else if (fileset_state == FS_SERIALIZE_DONE) {
> 
> Once we have waited in the FS_SERIALIZE_IN_PROGRESS, the file state can be
> FS_SERIALIZE_DONE immediately after that. So, won't it be better to have a
> separate if block for FS_SERIALIZE_DONE state? If you agree to do so then we
> can probably remove the comment: "* XXX It is possible that immediately after
> we have waited for a lock in ...".

Changed and slightly adjust the comments.

> 2.
> +void
> +pa_decr_and_wait_stream_block(void)
> +{
> + Assert(am_parallel_apply_worker());
> +
> + if (pg_atomic_sub_fetch_u32(&MyParallelShared->pending_stream_count,
> + 1) == 0)
> 
> I think here the count can go negative when we are in serialize mode because
> we don't increase it for serialize mode. I can't see any problem due to that but
> OTOH, this doesn't seem to be intended because in the future if we decide to
> implement the functionality of switching back to non-serialize mode, this could
> be a problem. Also, I guess we don't even need to try locking/unlocking the
> stream lock in that case.
> One idea to avoid this is to check if the pending count is zero then if file_set in
> not available raise an error (elog ERROR), otherwise, simply return from here.

Added the check.

> 
> 3. In apply_handle_stream_stop(), we are setting backendstate as idle for cases
> TRANS_LEADER_SEND_TO_PARALLEL and TRANS_PARALLEL_APPLY. For other
> cases, it is set by stream_stop_internal. I think it would be better to set the state
> explicitly for all cases to make the code look consistent and remove it from
> stream_stop_internal(). The other reason to remove setting the state from
> stream_stop_internal() is that when that function is invoked from other places
> like apply_handle_stream_commit(), it seems to be setting the idle before
> actually we reach the idle state.

Changed. Besides, I notice that the pgstat_report_activity in pa_stream_abort
for sub transaction is unnecessary since the state should be consistent with the
state set at last stream_stop, so I have removed that as well.

> 
> 4. Apart from the above, I have made a few changes in the comments, see
> attached.

Thanks, I have merged the patch.

Best regards,
Hou zj

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

20 December 2022, 10:16:23

On Tuesday, December 20, 2022 5:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Mon, Dec 19, 2022 at 6:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Sat, Dec 17, 2022 at 7:34 PM houzj.fnst@fujitsu.com
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > Agreed. I have addressed all the comments and did some cosmetic changes.
> > > Attach the new version patch set.
> > >
> >
> > Few comments:
> > ============
> >
> 
> Few more minor points:
> 1.
> -static inline void
> +void
>  changes_filename(char *path, Oid subid, TransactionId xid)  {
> 
> This function seems to be used only in worker.c. So, what is the need to make it
> extern?

Oh, I forgot to revert this change after removing the one caller outside of worker.c.
Changed.

> 
> 2. I have made a few changes in the comments. See attached. This is atop my
> yesterday's top-up patch.

Thanks, I have checked and merged this.

> I think we should merge the 0001 and 0002 patches as they need to be
> committed together.

Merged and ran the pgident for the patch set.

Attach the new version patch set which addressed all comments so far.

Best regards,
Hou zj

On Wed, Dec 21, 2022 9:07 AM Peter Smith <smithpb2250@gmail.com> wrote:
> FYI - applying v63-0001 using the latest master does not work.
> 
> git apply ../patches_misc/v63-0001-Perform-streaming-logical-transactions-by-
> parall.patch
> error: patch failed: src/backend/replication/logical/meson.build:1
> error: src/backend/replication/logical/meson.build: patch does not apply
> 
> Looks like a recent commit [1] to add copyrights broke the patch

Thanks for your reminder.
Rebased the patch set.

Attach the new patch set which also includes some
cosmetic comment changes.

Best regards,
Hou zj

On Thursday, December 22, 2022 8:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Wed, Dec 21, 2022 at 11:02 AM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Attach the new patch set which also includes some cosmetic comment
> > changes.
> >
> 
> Few minor comments:
> =================
> 1.
> +       <literal>t</literal> = spill the changes of in-progress
> transactions to+       disk and apply at once after the transaction is
> committed on the+       publisher,
> 
> Can we change this description to: "spill the changes of in-progress transactions
> to disk and apply at once after the transaction is committed on the publisher and
> received by the subscriber,"

Changed.

> 2.
>     table is in progress, there will be additional workers for the tables
> -   being synchronized.
> +   being synchronized. Moreover, if the streaming transaction is applied in
> +   parallel, there will be additional workers.
> 
> Do we need this change in the first patch? We skip parallel apply workers from
> view for the first patch. Am, I missing something?

No, I moved this to 0007 which include parallel apply workers in the view.

> 3.
> I think we would need a catversion bump for parallel apply feature because of
> below change:
> @@ -7913,11 +7913,16 @@ SCRAM-SHA-256$<replaceable><iteration
> count></replaceable>:<replaceable>&l
> 
>       <row>
>        <entry role="catalog_table_entry"><para role="column_definition">
> -       <structfield>substream</structfield> <type>bool</type>
> +       <structfield>substream</structfield> <type>char</type>
>        </para>
> 
> Am, I missing something? If not, then I think we can note that in the commit
> message to avoid forgetting it before commit.

Added.

> 
> 4. Kindly change the below comments:
> diff --git a/src/backend/replication/logical/applyparallelworker.c
> b/src/backend/replication/logical/applyparallelworker.c
> index 97f4a3037c..02bb608188 100644
> --- a/src/backend/replication/logical/applyparallelworker.c
> +++ b/src/backend/replication/logical/applyparallelworker.c
> @@ -9,11 +9,10 @@
>   *
>   * This file contains the code to launch, set up, and teardown a parallel apply
>   * worker which receives the changes from the leader worker and invokes
> routines
> - * to apply those on the subscriber database.
> - *
> - * This file contains routines that are intended to support setting up, using
> - * and tearing down a ParallelApplyWorkerInfo which is required so the leader
> - * worker and parallel apply workers can communicate with each other.
> + * to apply those on the subscriber database. Additionally, this file
> + contains
> + * routines that are intended to support setting up, using, and tearing
> + down a
> + * ParallelApplyWorkerInfo which is required so the leader worker and
> + parallel
> + * apply workers can communicate with each other.
>   *
>   * The parallel apply workers are assigned (if available) as soon as xact's
>   * first stream is received for subscriptions that have set their 'streaming'

Merged.

Besides, I also did the following changes:
1. Added maybe_reread_subscription_info in leader before assigning the
   transaction to parallel apply worker (Sawada-san's comments[1])
2. Removed the "out of parallel apply workers" LOG ( Sawada-san's comments[1])
3. Improved a elog message (Sawada-san's comments[1]).
4. Moved the testcases from 032_xx into existing 015_stream.pl which can save
the initialization time. Since we introduced quite a few testcases in this
patch set, so I did this to try to reduce the testing time that increased after
applying these patches.

[1] https://www.postgresql.org/message-id/CAD21AoDWd2pXau%2BpkYWOi87VGYrDD%3DOxakEDgOyUS%2BqV9XuAGA%40mail.gmail.com

Best regards,
Hou zj

Attachment

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

23 December 2022, 09:20:01

On Friday, December 23, 2022 1:52 PM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote:
> 
> On Thursday, December 22, 2022 8:05 PM Amit Kapila
> <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Dec 21, 2022 at 11:02 AM houzj.fnst@fujitsu.com
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > Attach the new patch set which also includes some cosmetic comment
> > > changes.
> > >
> >
> > Few minor comments:
> > =================
> > 1.
> > +       <literal>t</literal> = spill the changes of in-progress
> > transactions to+       disk and apply at once after the transaction is
> > committed on the+       publisher,
> >
> > Can we change this description to: "spill the changes of in-progress
> > transactions to disk and apply at once after the transaction is
> > committed on the publisher and received by the subscriber,"
> 
> Changed.
> 
> > 2.
> >     table is in progress, there will be additional workers for the tables
> > -   being synchronized.
> > +   being synchronized. Moreover, if the streaming transaction is applied in
> > +   parallel, there will be additional workers.
> >
> > Do we need this change in the first patch? We skip parallel apply
> > workers from view for the first patch. Am, I missing something?
> 
> No, I moved this to 0007 which include parallel apply workers in the view.
> 
> > 3.
> > I think we would need a catversion bump for parallel apply feature
> > because of below change:
> > @@ -7913,11 +7913,16 @@ SCRAM-SHA-256$<replaceable><iteration
> > count></replaceable>:<replaceable>&l
> >
> >       <row>
> >        <entry role="catalog_table_entry"><para role="column_definition">
> > -       <structfield>substream</structfield> <type>bool</type>
> > +       <structfield>substream</structfield> <type>char</type>
> >        </para>
> >
> > Am, I missing something? If not, then I think we can note that in the
> > commit message to avoid forgetting it before commit.
> 
> Added.
> 
> >
> > 4. Kindly change the below comments:
> > diff --git a/src/backend/replication/logical/applyparallelworker.c
> > b/src/backend/replication/logical/applyparallelworker.c
> > index 97f4a3037c..02bb608188 100644
> > --- a/src/backend/replication/logical/applyparallelworker.c
> > +++ b/src/backend/replication/logical/applyparallelworker.c
> > @@ -9,11 +9,10 @@
> >   *
> >   * This file contains the code to launch, set up, and teardown a parallel apply
> >   * worker which receives the changes from the leader worker and
> > invokes routines
> > - * to apply those on the subscriber database.
> > - *
> > - * This file contains routines that are intended to support setting
> > up, using
> > - * and tearing down a ParallelApplyWorkerInfo which is required so
> > the leader
> > - * worker and parallel apply workers can communicate with each other.
> > + * to apply those on the subscriber database. Additionally, this file
> > + contains
> > + * routines that are intended to support setting up, using, and
> > + tearing down a
> > + * ParallelApplyWorkerInfo which is required so the leader worker and
> > + parallel
> > + * apply workers can communicate with each other.
> >   *
> >   * The parallel apply workers are assigned (if available) as soon as xact's
> >   * first stream is received for subscriptions that have set their 'streaming'
> 
> Merged.
> 
> Besides, I also did the following changes:
> 1. Added maybe_reread_subscription_info in leader before assigning the
>    transaction to parallel apply worker (Sawada-san's comments[1]) 2. Removed
> the "out of parallel apply workers" LOG ( Sawada-san's comments[1]) 3.
> Improved a elog message (Sawada-san's comments[1]).
> 4. Moved the testcases from 032_xx into existing 015_stream.pl which can save
> the initialization time. Since we introduced quite a few testcases in this patch set,
> so I did this to try to reduce the testing time that increased after applying these
> patches.

I noticed a CFbot failure in one of the new testcases in 015_stream.pl which
comes from old 032_xx.pl. It's because I slightly adjusted the change size in a
transaction in last version which cause the transaction's size not to exceed the
decoding work mem, so the transaction is not being applied as expected as
streaming transactions(it is applied as a non-stremaing transaction) which
cause the failure. Attach the new version patch which fixed this miss.

Best regards,
Hou zj

Attachment

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

26 December 2022, 04:22:41

On Friday, December 23, 2022 5:20 PM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote:
> 
> I noticed a CFbot failure in one of the new testcases in 015_stream.pl which
> comes from old 032_xx.pl. It's because I slightly adjusted the change size in a
> transaction in last version which cause the transaction's size not to exceed the
> decoding work mem, so the transaction is not being applied as expected as
> streaming transactions(it is applied as a non-stremaing transaction) which cause
> the failure. Attach the new version patch which fixed this miss.
> 

Since the GUC used to force stream changes has been committed, I removed that
patch from the patch set here and rebased the testcases based on that commit.
Here is the rebased patch set.

Best regards,
Hou zj

Attachment

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

26 December 2022, 11:51:05

On Mon, Dec 26, 2022 at 9:52 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> On Friday, December 23, 2022 5:20 PM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote:
>
> Since the GUC used to force stream changes has been committed, I removed that
> patch from the patch set here and rebased the testcases based on that commit.
> Here is the rebased patch set.
>

Few comments on 0002 and 0001 patches
=================================
1.
+    if ($is_parallel)
+    {
+        $node_subscriber->append_conf('postgresql.conf',
+            "log_min_messages = debug1");
+        $node_subscriber->reload;
+    }
+
+    # Check the subscriber log from now on.
+    $offset = -s $node_subscriber->logfile;
+
+    $in .= q{
+    BEGIN;
+    INSERT INTO test_tab SELECT i, md5(i::text) FROM
generate_series(3, 5000) s(i);

How can we guarantee that reload would have taken place before this
next test? I see that 020_archive_status.pl is executing a query to
ensure the reload has been taken into consideration. Can we do the
same?

2. It is not very clear whether converting 017_stream_ddl and
019_stream_subxact_ddl_abort adds much value. They seem to be mostly
testing DDL/DML interaction of publisher side. We can probably check
the code coverage by removing the parallel version for these two files
and remove them unless it covers some extra code. If we decide to
remove parallel version for these two files then we can probably add a
comment atop these files indicating why we don't have a version that
parallel option for these tests.

3.
+# Drop the unique index on the subscriber, now it works.
+$node_subscriber->safe_psql('postgres', "DROP INDEX idx_tab");
+
+# Wait for this streaming transaction to be applied in the apply worker.
 $node_publisher->wait_for_catchup($appname);

 $result =
-  $node_subscriber->safe_psql('postgres',
- "SELECT count(*), count(c), count(d = 999) FROM test_tab");
-is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+  $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab_2");
+is($result, qq(5001), 'data replicated to subscriber after dropping index');

-# Test the streaming in binary mode
+# Clean up test data from the environment.
+$node_publisher->safe_psql('postgres', "TRUNCATE TABLE test_tab_2");
+$node_publisher->wait_for_catchup($appname);
 $node_subscriber->safe_psql('postgres',
- "ALTER SUBSCRIPTION tap_sub SET (binary = on)");
+ "CREATE UNIQUE INDEX idx_tab on test_tab_2(a)");

What is the need to first Drop the index and then recreate it after a few lines?

4. Attached, find some comment improvements atop v67-0002* patch.
Similar comments need to be changed in other test files.

5. Attached, find some comment improvements atop v67-0001* patch.

-- 
With Regards,
Amit Kapila.

On Mon, Dec 26, 2022 19:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> Few comments on 0002 and 0001 patches
> =================================

Thanks for your comments.

> 1.
> +    if ($is_parallel)
> +    {
> +        $node_subscriber->append_conf('postgresql.conf',
> +            "log_min_messages = debug1");
> +        $node_subscriber->reload;
> +    }
> +
> +    # Check the subscriber log from now on.
> +    $offset = -s $node_subscriber->logfile;
> +
> +    $in .= q{
> +    BEGIN;
> +    INSERT INTO test_tab SELECT i, md5(i::text) FROM
> generate_series(3, 5000) s(i);
> 
> How can we guarantee that reload would have taken place before this
> next test? I see that 020_archive_status.pl is executing a query to
> ensure the reload has been taken into consideration. Can we do the
> same?

Agree. Improved as suggested.

> 2. It is not very clear whether converting 017_stream_ddl and
> 019_stream_subxact_ddl_abort adds much value. They seem to be mostly
> testing DDL/DML interaction of publisher side. We can probably check
> the code coverage by removing the parallel version for these two files
> and remove them unless it covers some extra code. If we decide to
> remove parallel version for these two files then we can probably add a
> comment atop these files indicating why we don't have a version that
> parallel option for these tests.

I have checked this and removed the parallel version for these two files.
Also added some comments atop these two test files to explain this.

> 3.
> +# Drop the unique index on the subscriber, now it works.
> +$node_subscriber->safe_psql('postgres', "DROP INDEX idx_tab");
> +
> +# Wait for this streaming transaction to be applied in the apply worker.
>  $node_publisher->wait_for_catchup($appname);
> 
>  $result =
> -  $node_subscriber->safe_psql('postgres',
> - "SELECT count(*), count(c), count(d = 999) FROM test_tab");
> -is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
> +  $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM
> test_tab_2");
> +is($result, qq(5001), 'data replicated to subscriber after dropping index');
> 
> -# Test the streaming in binary mode
> +# Clean up test data from the environment.
> +$node_publisher->safe_psql('postgres', "TRUNCATE TABLE test_tab_2");
> +$node_publisher->wait_for_catchup($appname);
>  $node_subscriber->safe_psql('postgres',
> - "ALTER SUBSCRIPTION tap_sub SET (binary = on)");
> + "CREATE UNIQUE INDEX idx_tab on test_tab_2(a)");
> 
> What is the need to first Drop the index and then recreate it after a few lines?

Since we want the two transactions to complete normally without conflicts due
to the unique index, we temporarily delete the index.
I added some new comments to explain this.

> 4. Attached, find some comment improvements atop v67-0002* patch.
> Similar comments need to be changed in other test files.

Thanks, I have checked and merge them. And also changed similar comments in
other test files.

> 5. Attached, find some comment improvements atop v67-0001* patch.

Thanks, I have checked and merge them.

Attach the new version patch which addressed all above comments and part of
comments from [1] except one comment that are being discussed.

[1] - https://www.postgresql.org/message-id/CAD21AoDvT%2BTv3auBBShk19EkKLj6ByQtnAzfMjh49BhyT7f4Nw%40mail.gmail.com

Regards,
Wang wei

On Tue, Dec 27, 2022 at 11:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Dec 26, 2022 at 10:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Dec 26, 2022 at 6:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > ---
> > > +        if (!pa_can_start(xid))
> > > +                return;
> > > +
> > > +        /* First time through, initialize parallel apply worker state
> > > hashtable. */
> > > +        if (!ParallelApplyTxnHash)
> > > +        {
> > > +                HASHCTL                ctl;
> > > +
> > > +                MemSet(&ctl, 0, sizeof(ctl));
> > > +                ctl.keysize = sizeof(TransactionId);
> > > +                ctl.entrysize = sizeof(ParallelApplyWorkerEntry);
> > > +                ctl.hcxt = ApplyContext;
> > > +
> > > +                ParallelApplyTxnHash = hash_create("logical
> > > replication parallel apply workershash",
> > > +
> > >              16, &ctl,
> > > +
> > >              HASH_ELEM |HASH_BLOBS | HASH_CONTEXT);
> > > +        }
> > > +
> > > +        /*
> > > +         * It's necessary to reread the subscription information
> > > before assigning
> > > +         * the transaction to a parallel apply worker. Otherwise, the
> > > leader may
> > > +         * not be able to reread the subscription information if streaming
> > > +         * transactions keep coming and are handled by parallel apply workers.
> > > +         */
> > > +        maybe_reread_subscription();
> > >
> > > pa_can_start() checks if the skiplsn is an invalid xid or not, and
> > > then maybe_reread_subscription() could update the skiplsn to a valid
> > > value. As the comments in pa_can_start() says, it won't work. I think
> > > we should call maybe_reread_subscription() in
> > > apply_handle_stream_start() before calling pa_allocate_worker().
> > >
> >
> > But I think a similar thing can happen when we start the worker and
> > then before the transaction ends, we do maybe_reread_subscription().
>
> Where do we do maybe_reread_subscription() in this case? IIUC if the
> leader sends all changes to the worker, there is no chance for the
> leader to do maybe_reread_subscription except for when waiting for the
> input.

Yes, this is the point where it can happen. IT can happen when there
is some delay between different streaming chunks.

> On reflection, adding maybe_reread_subscription() to
> apply_handle_stream_start() adds one extra call of it so it's not
> good. Alternatively, we can do that in pa_can_start() before checking
> the skiplsn. I think we do a similar thing in AllTablesyncsRead() --
> update the information before the check if necessary.
>
> > I think we should try to call maybe_reread_subscription() when we are
> > reasonably sure that we are going to enter parallel mode, otherwise,
> > anyway, it will be later called by the leader worker.
>
> It isn't a big problem even if we update the skiplsn after launching a
> worker since we will skip the transaction the next time. But it would
> be more consistent with the current behavior. As I mentioned above,
> doing it in pa_can_start() seems to be reasonable to me. What do you
> think?
>

Okay, we can do it in pa_can_start but then let's do it before we
check the parallel_apply flag as that can also be changed if the
streaming mode is changed. Please see the changes in the attached
patch which is atop the 0001 and 0002 patches. I have made a few
comment improvements as well.

-- 
With Regards,
Amit Kapila.

Attachment

v68-0001-changes_amit_1.patch

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

27 December 2022, 11:37:20

On Tue, Dec 27, 2022 at 10:24 AM wangw.fnst@fujitsu.com
<wangw.fnst@fujitsu.com> wrote:
>
> Attach the new version patch which addressed all above comments and part of
> comments from [1] except one comment that are being discussed.
>

1.
+# Test that the deadlock is detected among leader and parallel apply workers.
+
+$node_subscriber->append_conf('postgresql.conf', "deadlock_timeout = 1ms");
+$node_subscriber->reload;
+

A. I see that the other existing tests have deadlock_timeout set as
10ms, 100ms, 100s, etc. Is there a reason to keep so low here? Shall
we keep it as 10ms?
B. /among leader/among the leader

2. Can we leave having tests in 022_twophase_cascade to be covered by
parallel mode? The two-phase and parallel apply will be covered by
023_twophase_stream, so not sure if we get any extra coverage by
022_twophase_cascade.

3. Let's combine 0001 and 0002 as both have got reviewed independently.

-- 
With Regards,
Amit Kapila.

RE: Perform streaming logical transactions by background workers and parallel apply

From

"wangw.fnst@fujitsu.com"

Date:

28 December 2022, 04:38:58

On Tue, Dec 27, 2022 19:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Dec 27, 2022 at 10:24 AM wangw.fnst@fujitsu.com
> <wangw.fnst@fujitsu.com> wrote:
> >
> > Attach the new version patch which addressed all above comments and part
> of
> > comments from [1] except one comment that are being discussed.
> >

Thanks for your comments.

> 1.
> +# Test that the deadlock is detected among leader and parallel apply workers.
> +
> +$node_subscriber->append_conf('postgresql.conf', "deadlock_timeout =
> 1ms");
> +$node_subscriber->reload;
> +
> 
> A. I see that the other existing tests have deadlock_timeout set as
> 10ms, 100ms, 100s, etc. Is there a reason to keep so low here? Shall
> we keep it as 10ms?

No, I think you are right. Keep it as 10ms.

> B. /among leader/among the leader

Fixed.

> 2. Can we leave having tests in 022_twophase_cascade to be covered by
> parallel mode? The two-phase and parallel apply will be covered by
> 023_twophase_stream, so not sure if we get any extra coverage by
> 022_twophase_cascade.

Compared with 023_twophase_stream, there is "rollback a subtransaction" in
022_twophase_cascade, but since this part of the code can be covered by tests
in 018_stream_subxact_abort, I think we can remove parallel version for
022_twophase_cascade. So I reverted changes in 022_twophase_cascade for
parallel mode and added some comments atop this file.

> 3. Let's combine 0001 and 0002 as both have got reviewed independently.

Combined them into one patch.

And I also checked and merged the diff patch in [1].

Besides, also fixed the below problem:
In previous versions, we didn't wait for STREAM_ABORT transactions to complete.
But in extreme cases, this can cause problems if the STREAM_ABORT transaction
doesn't complete and xid wraparound occurs on the publisher-side. Fixed this by
waiting for the STREAM_ABORT transaction to complete.

Attach the new patch set.

[1] - https://www.postgresql.org/message-id/CAA4eK1%2B5gTjHzWovkbUj%2BxsQ9yO9jVcKsS-3c5ZXLFy8JmfT%3DA%40mail.gmail.com

Regards,
Wang wei

Attachment

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

29 December 2022, 13:24:53

On Wed, Dec 28, 2022 at 10:09 AM wangw.fnst@fujitsu.com
<wangw.fnst@fujitsu.com> wrote:
>

I have made a number of changes in the comments, removed extra list
copy in pa_launch_parallel_worker(), and removed unnecessary include
in worker. Please see the attached and let me know what you think.
Feel free to rebase and send the remaining patches.

-- 
With Regards,
Amit Kapila.

Attachment

v70-0001-Perform-streaming-logical-transactions-by-parall.patch

RE: Perform streaming logical transactions by background workers and parallel apply

From

"wangw.fnst@fujitsu.com"

Date:

30 December 2022, 10:25:32

On Thur, Dec 29, 2022 21:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Dec 28, 2022 at 10:09 AM wangw.fnst@fujitsu.com
> <wangw.fnst@fujitsu.com> wrote:
> >
> 
> I have made a number of changes in the comments, removed extra list
> copy in pa_launch_parallel_worker(), and removed unnecessary include
> in worker. Please see the attached and let me know what you think.
> Feel free to rebase and send the remaining patches.

Thanks for your improvement.

I've checked it and it looks good to me.
Rebased the other patches and ran the pgident for the patch set.

Attach the new patch set.

Regards,
Wang wei

Attachment

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

02 January 2023, 10:53:49

On Fri, Dec 30, 2022 at 3:55 PM wangw.fnst@fujitsu.com
<wangw.fnst@fujitsu.com> wrote:
>
> I've checked it and it looks good to me.
> Rebased the other patches and ran the pgident for the patch set.
>
> Attach the new patch set.
>

I have added a few DEBUG messages and changed a few comments in the
0001 patch. With that v71-0001* looks good to me and I'll commit it
later this week (by Thursday or Friday) unless there are any major
comments or objections.

-- 
With Regards,
Amit Kapila.

Attachment

RE: Perform streaming logical transactions by background workers and parallel apply

From

"wangw.fnst@fujitsu.com"

Date:

03 January 2023, 05:40:22

On Mon, Jan 2, 2023 at 18:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Dec 30, 2022 at 3:55 PM wangw.fnst@fujitsu.com
> <wangw.fnst@fujitsu.com> wrote:
> >
> > I've checked it and it looks good to me.
> > Rebased the other patches and ran the pgident for the patch set.
> >
> > Attach the new patch set.
> >
> 
> I have added a few DEBUG messages and changed a few comments in the
> 0001 patch. With that v71-0001* looks good to me and I'll commit it
> later this week (by Thursday or Friday) unless there are any major
> comments or objections.

Thanks for your improvement.

Rebased the patch set because the new change in HEAD (c8e1ba7).
Attach the new patch set.

Regards,
Wang wei

Attachment

Re: Perform streaming logical transactions by background workers and parallel apply

From

shveta malik

Date:

03 January 2023, 09:09:08

On Tue, Jan 3, 2023 at 11:10 AM wangw.fnst@fujitsu.com
<wangw.fnst@fujitsu.com> wrote:
>
> On Mon, Jan 2, 2023 at 18:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Fri, Dec 30, 2022 at 3:55 PM wangw.fnst@fujitsu.com
> > <wangw.fnst@fujitsu.com> wrote:
> > >
> > > I've checked it and it looks good to me.
> > > Rebased the other patches and ran the pgident for the patch set.
> > >
> > > Attach the new patch set.
> > >
> >
> > I have added a few DEBUG messages and changed a few comments in the
> > 0001 patch. With that v71-0001* looks good to me and I'll commit it
> > later this week (by Thursday or Friday) unless there are any major
> > comments or objections.
>
> Thanks for your improvement.
>
> Rebased the patch set because the new change in HEAD (c8e1ba7).
> Attach the new patch set.
>
> Regards,
> Wang wei

Hi,
In continuation with [1] and [2], I did some performance testing on
v70-0001 patch.

This test used synchronous logical replication and compared SQL
execution times before and after applying the patch.

The following cases are tested by varying logical_decoding_work_mem:
a) Bulk insert.
b) Bulk delete
c) Bulk update
b) Rollback to savepoint. (different percentages of changes in the
transaction are rolled back).

The tests are performed ten times, and the average of the middle eight is taken.

The scripts are the same as before [1]. The scripts for additional
update and delete testing are attached.

The results are as follows:

RESULT - bulk insert (5kk)
---------------------------------------------------------------
logical_decoding_work_mem     64kB        256kB       64MB
HEAD                                         34.475      34.222      34.400
patched                                      20.168      20.181      20.510
Compare with HEAD                -41.49%     -41.029%    -40.377%


RESULT - bulk delete (5kk)
---------------------------------------------------------------
logical_decoding_work_mem     64kB        256kB       64MB
HEAD                                         40.286      41.312      41.312
patched                                      23.749      23.759      23.480
Compare with HEAD                 -41.04%     -42.48%    -43.16%


RESULT - bulk update (5kk)
---------------------------------------------------------------
logical_decoding_work_mem     64kB        256kB       64MB
HEAD                                         63.650      65.260      65.459
patched                                      46.692      46.275      48.281
Compare with HEAD                -26.64%     -29.09%    -26.24%


RESULT - rollback 10% (5kk)
---------------------------------------------------------------
logical_decoding_work_mem     64kB        256kB       64MB
HEAD                                        33.386      33.213      31.990
patched                                      20.540      19.295      18.139
Compare with HEAD                 -38.47%     -41.90%    -43.29%


RESULT - rollback 20% (5kk)
---------------------------------------------------------------
logical_decoding_work_mem     64kB        256kB       64MB
HEAD                                         32.150      31.871      30.825
patched                                      19.331      19.366      18.285
Compare with HEAD                -39.87%     -39.23%     -40.68%


RESULT - rollback 30% (5kk)
---------------------------------------------------------------
logical_decoding_work_mem   64kB        256kB       64MB
HEAD                                      28.611      30.139      29.433
patched                                   19.632      19.838      18.374
Compare with HEAD               -31.38%     -34.17%      -37.57%


RESULT - rollback 50% (5kk)
---------------------------------------------------------------
logical_decoding_work_mem   64kB        256kB       64MB
HEAD                                       27.410      27.167     25.990
patched                                    19.982      18.749     18.048
Compare with HEAD               -27.099%    -30.98%      -30.55%

(if "Compare with HEAD" is a positive number, it means worse than
HEAD; if it is a negative number, it means better than HEAD.)

Summary:
Update shows 26-29% improvement, while insert and delete shows ~40% improvement.
In the case of rollback, the improvement is somewhat between 27-42%.
The improvement slightly decreases with larger amounts of data being
rolled back.


[1]
https://www.postgresql.org/message-id/OSZPR01MB63103AA97349BBB858E27DEAFD499%40OSZPR01MB6310.jpnprd01.prod.outlook.com
[2]
https://www.postgresql.org/message-id/OSZPR01MB6310174063C9144D2081F657FDE09%40OSZPR01MB6310.jpnprd01.prod.outlook.com

thanks
Shveta

Attachment

scripts.zip

Re: Perform streaming logical transactions by background workers and parallel apply

From

Masahiko Sawada

Date:

04 January 2023, 05:31:25

On Tue, Jan 3, 2023 at 2:40 PM wangw.fnst@fujitsu.com
<wangw.fnst@fujitsu.com> wrote:
>
> On Mon, Jan 2, 2023 at 18:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Fri, Dec 30, 2022 at 3:55 PM wangw.fnst@fujitsu.com
> > <wangw.fnst@fujitsu.com> wrote:
> > >
> > > I've checked it and it looks good to me.
> > > Rebased the other patches and ran the pgident for the patch set.
> > >
> > > Attach the new patch set.
> > >
> >
> > I have added a few DEBUG messages and changed a few comments in the
> > 0001 patch. With that v71-0001* looks good to me and I'll commit it
> > later this week (by Thursday or Friday) unless there are any major
> > comments or objections.
>
> Thanks for your improvement.
>
> Rebased the patch set because the new change in HEAD (c8e1ba7).
> Attach the new patch set.

There are some unused parameters in v72 patches:

+static bool
+pa_can_start(TransactionId xid)
+{
+        Assert(TransactionIdIsValid(xid));

'xid' is used only for the assertion check but I don't think it's necessary.

---
+/*
+ * Make sure the leader apply worker tries to read from our error
queue one more
+ * time. This guards against the case where we exit uncleanly without sending
+ * an ErrorResponse, for example because some code calls proc_exit directly.
+ */
+static void
+pa_shutdown(int code, Datum arg)

Similarly, we don't use 'code' here.

---
+/*
+ * Handle a single protocol message received from a single parallel apply
+ * worker.
+ */
+static void
+HandleParallelApplyMessage(ParallelApplyWorkerInfo *winfo, StringInfo msg)

In addition, the same is true for 'winfo'.

The rest looks good to me.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Perform streaming logical transactions by background workers and parallel apply

From

Masahiko Sawada

Date:

04 January 2023, 06:39:54

On Wed, Jan 4, 2023 at 2:31 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Jan 3, 2023 at 2:40 PM wangw.fnst@fujitsu.com
> <wangw.fnst@fujitsu.com> wrote:
> >
> > On Mon, Jan 2, 2023 at 18:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > On Fri, Dec 30, 2022 at 3:55 PM wangw.fnst@fujitsu.com
> > > <wangw.fnst@fujitsu.com> wrote:
> > > >
> > > > I've checked it and it looks good to me.
> > > > Rebased the other patches and ran the pgident for the patch set.
> > > >
> > > > Attach the new patch set.
> > > >
> > >
> > > I have added a few DEBUG messages and changed a few comments in the
> > > 0001 patch. With that v71-0001* looks good to me and I'll commit it
> > > later this week (by Thursday or Friday) unless there are any major
> > > comments or objections.
> >
> > Thanks for your improvement.
> >
> > Rebased the patch set because the new change in HEAD (c8e1ba7).
> > Attach the new patch set.
>
> There are some unused parameters in v72 patches:
>
> ---
> +/*
> + * Make sure the leader apply worker tries to read from our error
> queue one more
> + * time. This guards against the case where we exit uncleanly without sending
> + * an ErrorResponse, for example because some code calls proc_exit directly.
> + */
> +static void
> +pa_shutdown(int code, Datum arg)
>
> Similarly, we don't use 'code' here.

This is necessary. Sorry for the noise.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

04 January 2023, 10:55:34

On Wed, Jan 4, 2023 at 13:31 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Tue, Jan 3, 2023 at 2:40 PM wangw.fnst@fujitsu.com 
> <wangw.fnst@fujitsu.com> wrote:
> >
> > On Mon, Jan 2, 2023 at 18:54 PM Amit Kapila 
> > <amit.kapila16@gmail.com>
> wrote:
> > > On Fri, Dec 30, 2022 at 3:55 PM wangw.fnst@fujitsu.com 
> > > <wangw.fnst@fujitsu.com> wrote:
> > > >
> > > > I've checked it and it looks good to me.
> > > > Rebased the other patches and ran the pgident for the patch set.
> > > >
> > > > Attach the new patch set.
> > > >
> > >
> > > I have added a few DEBUG messages and changed a few comments in 
> > > the
> > > 0001 patch. With that v71-0001* looks good to me and I'll commit 
> > > it later this week (by Thursday or Friday) unless there are any 
> > > major comments or objections.
> >
> > Thanks for your improvement.
> >
> > Rebased the patch set because the new change in HEAD (c8e1ba7).
> > Attach the new patch set.
> 
> There are some unused parameters in v72 patches:

Thanks for your comments!

> +static bool
> +pa_can_start(TransactionId xid)
> +{
> +        Assert(TransactionIdIsValid(xid));
> 
> 'xid' is used only for the assertion check but I don't think it's necessary.

Agree. Removed this check.

> ---
> +/*
> + * Handle a single protocol message received from a single parallel 
> +apply
> + * worker.
> + */
> +static void
> +HandleParallelApplyMessage(ParallelApplyWorkerInfo *winfo, StringInfo 
> +msg)
> 
> In addition, the same is true for 'winfo'.

Agree. Removed this parameter.

Attach the new patch set.
Apart from addressing Sawada-San's comments, I also did some other minor
changes in the patch:

* Adjusted a testcase about crash restart in 023_twophase_stream.pl, I
  skipped the check for DEBUG msg as the msg might not be output if the crash happens
  before that.
* Adjusted the code in pg_lock_status() to make the fields of
  applytransaction lock display in more appropriate places.
* Add a comment to explain why we unlock the transaction before aborting the
  transaction in parallel apply worker.

Best regards,
Hou zj

On Wednesday, January 4, 2023 9:29 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> 
> On Wed, Jan 4, 2023 at 6:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Jan 4, 2023 at 4:52 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > 2.
> > > + * Since the database structure (schema of subscription tables,
> > > + constraints,
> > > + * etc.) of the publisher and subscriber could be different,
> > > + applying
> > > + * transactions in parallel mode on the subscriber side can cause
> > > + some
> > > + * deadlocks that do not occur on the publisher side.
> > >
> > > I think this paragraph needs to be rephrased a bit.  It is saying
> > > that some deadlock can occur on subscribers which did not occur on
> > > the publisher.  I think what it should be conveying is that the
> > > deadlock can occur due to concurrently applying the
> > > conflicting/dependent transactions which are not
> > > conflicting/dependent on the publisher due to <explain reason>.
> > > Because if we create the same schema on the publisher it might not
> > > have ended up in a deadlock instead it would have been executed in
> > > sequence (due to lock waiting). So the main point we are conveying
> > > is that the transaction which was independent of each other on the
> > > publisher could be dependent on the subscriber and they can end up in
> deadlock due to parallel apply.
> > >
> >
> > How about changing it to: "We have a risk of deadlock due to
> > parallelly applying the transactions that were independent on the
> > publisher side but became dependent on the subscriber side due to the
> > different database structures (like schema of subscription tables,
> > constraints, etc.) on each side.
> 
> I think this looks good to me.

Thanks for the comments.
Attach the new version patch set which changed the comments as suggested.

Best regards,
Hou zj

Attachment

Re: Perform streaming logical transactions by background workers and parallel apply

From

Dilip Kumar

Date:

05 January 2023, 08:21:53

On Thu, Jan 5, 2023 at 9:07 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> On Wednesday, January 4, 2023 9:29 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

> > I think this looks good to me.
>
> Thanks for the comments.
> Attach the new version patch set which changed the comments as suggested.

Thanks for the updated patch, while testing this I see one strange
behavior which seems like bug to me, here is the step to reproduce

1. start 2 servers(config: logical_decoding_work_mem=64kB)
./pg_ctl -D data/ -c -l pub_logs start
./pg_ctl -D data1/ -c -l sub_logs start

2. Publisher:
create table t(a int PRIMARY KEY ,b text);
CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS
'select array_agg(md5(g::text))::text from generate_series(1, 256) g';
create publication test_pub for table t
with(PUBLISH='insert,delete,update,truncate');
alter table t replica identity FULL ;
insert into t values (generate_series(1,2000),large_val()) ON CONFLICT
(a) DO UPDATE SET a=EXCLUDED.a*300;

3. Subscription Server:
create table t(a int,b text);
create subscription test_sub CONNECTION 'host=localhost port=5432
dbname=postgres' PUBLICATION test_pub WITH ( slot_name =
test_slot_sub1,streaming=parallel);

4. Publication Server:
begin ;
savepoint a;
delete from t;
savepoint b;
insert into t values (generate_series(1,5000),large_val()) ON CONFLICT
(a) DO UPDATE SET a=EXCLUDED.a*30000;  -- (while executing this start
publisher in 2-3 secs)

Restart the publication server, while the transaction is still in an
uncommitted state.
./pg_ctl -D data/ -c -l pub_logs stop -mi
./pg_ctl -D data/ -c -l pub_logs start -mi

after this, the parallel apply worker stuck in waiting on stream lock
forever (even after 10 mins) -- see below, from subscriber logs I can
see one of the parallel apply worker [75677] started but never
finished [no error], after that I have performed more operation [same
insert] which got applied by new parallel apply worked which got
started and finished within 1 second.

dilipku+  75660      1  0 13:39 ?        00:00:00
/home/dilipkumar/work/PG/install/bin/postgres -D data
dilipku+  75661  75660  0 13:39 ?        00:00:00 postgres: checkpointer
dilipku+  75662  75660  0 13:39 ?        00:00:00 postgres: background writer
dilipku+  75664  75660  0 13:39 ?        00:00:00 postgres: walwriter
dilipku+  75665  75660  0 13:39 ?        00:00:00 postgres: autovacuum launcher
dilipku+  75666  75660  0 13:39 ?        00:00:00 postgres: logical
replication launcher
dilipku+  75675  75595  0 13:39 ?        00:00:00 postgres: logical
replication apply worker for subscription 16389
dilipku+  75676  75660  0 13:39 ?        00:00:00 postgres: walsender
dilipkumar postgres ::1(42192) START_REPLICATION
dilipku+  75677  75595  0 13:39 ?        00:00:00 postgres: logical
replication parallel apply worker for subscription 16389  waiting

Subscriber logs:
2023-01-05 13:39:07.261 IST [75595] LOG:  background worker "logical
replication worker" (PID 75649) exited with exit code 1
2023-01-05 13:39:12.272 IST [75675] LOG:  logical replication apply
worker for subscription "test_sub" has started
2023-01-05 13:39:12.307 IST [75677] LOG:  logical replication parallel
apply worker for subscription "test_sub" has started
2023-01-05 13:43:31.003 IST [75596] LOG:  checkpoint starting: time
2023-01-05 13:46:32.045 IST [76337] LOG:  logical replication parallel
apply worker for subscription "test_sub" has started
2023-01-05 13:46:35.214 IST [76337] LOG:  logical replication parallel
apply worker for subscription "test_sub" has finished
2023-01-05 13:46:50.241 IST [76384] LOG:  logical replication parallel
apply worker for subscription "test_sub" has started
2023-01-05 13:46:53.676 IST [76384] LOG:  logical replication parallel
apply worker for subscription "test_sub" has finished

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

05 January 2023, 11:33:08

On Thursday, January 5, 2023 4:22 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> 
> On Thu, Jan 5, 2023 at 9:07 AM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Wednesday, January 4, 2023 9:29 PM Dilip Kumar
> <dilipbalaut@gmail.com> wrote:
> 
> > > I think this looks good to me.
> >
> > Thanks for the comments.
> > Attach the new version patch set which changed the comments as
> suggested.
> 
> Thanks for the updated patch, while testing this I see one strange
> behavior which seems like bug to me, here is the step to reproduce
> 
> 1. start 2 servers(config: logical_decoding_work_mem=64kB)
> ./pg_ctl -D data/ -c -l pub_logs start
> ./pg_ctl -D data1/ -c -l sub_logs start
> 
> 2. Publisher:
> create table t(a int PRIMARY KEY ,b text);
> CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS
> 'select array_agg(md5(g::text))::text from generate_series(1, 256) g';
> create publication test_pub for table t
> with(PUBLISH='insert,delete,update,truncate');
> alter table t replica identity FULL ;
> insert into t values (generate_series(1,2000),large_val()) ON CONFLICT
> (a) DO UPDATE SET a=EXCLUDED.a*300;
> 
> 3. Subscription Server:
> create table t(a int,b text);
> create subscription test_sub CONNECTION 'host=localhost port=5432
> dbname=postgres' PUBLICATION test_pub WITH ( slot_name =
> test_slot_sub1,streaming=parallel);
> 
> 4. Publication Server:
> begin ;
> savepoint a;
> delete from t;
> savepoint b;
> insert into t values (generate_series(1,5000),large_val()) ON CONFLICT
> (a) DO UPDATE SET a=EXCLUDED.a*30000;  -- (while executing this start
> publisher in 2-3 secs)
> 
> Restart the publication server, while the transaction is still in an
> uncommitted state.
> ./pg_ctl -D data/ -c -l pub_logs stop -mi
> ./pg_ctl -D data/ -c -l pub_logs start -mi
> 
> after this, the parallel apply worker stuck in waiting on stream lock
> forever (even after 10 mins) -- see below, from subscriber logs I can
> see one of the parallel apply worker [75677] started but never
> finished [no error], after that I have performed more operation [same
> insert] which got applied by new parallel apply worked which got
> started and finished within 1 second.
> 

Thanks for reporting the problem.

After analyzing the behavior, I think it's a bug on publisher side which
is not directly related to parallel apply.

I think the root reason is that we didn't try to send a stream end(stream
abort) message to subscriber for the crashed transaction which was streamed
before.

The behavior is that, after restarting, the publisher will start to decode the
transaction that aborted due to crash, and when try to stream the first change
of that transaction, it will send a stream start message but then it realizes
that the transaction was aborted, so it will enter the PG_CATCH block of
ReorderBufferProcessTXN() and call ReorderBufferResetTXN() which send the
stream stop message. And in this case, there would be a parallel apply worker
started on subscriber waiting for stream end message which will never come.

I think the same behavior happens for the non-parallel mode which will cause
a stream file left on subscriber and will not be cleaned until the apply worker is
restarted.

To fix it, I think we need to send a stream abort message when we are cleaning
up crashed transaction on publisher(e.g., in ReorderBufferAbortOld()). And here
is a tiny patch which change the same. I have confirmed that the bug is fixed
and all regression tests pass.

What do you think ?
I will start a new thread and try to write a testcase if possible
after reaching a consensus.

Best regards,
Hou zj

Attachment

0001-fix-stream-changes-for-crashed-transaction.patch

Re: Perform streaming logical transactions by background workers and parallel apply

From

Dilip Kumar

Date:

05 January 2023, 11:53:36

On Thu, Jan 5, 2023 at 5:03 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> On Thursday, January 5, 2023 4:22 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >

> Thanks for reporting the problem.
>
> After analyzing the behavior, I think it's a bug on publisher side which
> is not directly related to parallel apply.
>
> I think the root reason is that we didn't try to send a stream end(stream
> abort) message to subscriber for the crashed transaction which was streamed
> before.
> The behavior is that, after restarting, the publisher will start to decode the
> transaction that aborted due to crash, and when try to stream the first change
> of that transaction, it will send a stream start message but then it realizes
> that the transaction was aborted, so it will enter the PG_CATCH block of
> ReorderBufferProcessTXN() and call ReorderBufferResetTXN() which send the
> stream stop message. And in this case, there would be a parallel apply worker
> started on subscriber waiting for stream end message which will never come.

I suspected it but didn't analyze this.

> I think the same behavior happens for the non-parallel mode which will cause
> a stream file left on subscriber and will not be cleaned until the apply worker is
> restarted.
> To fix it, I think we need to send a stream abort message when we are cleaning
> up crashed transaction on publisher(e.g., in ReorderBufferAbortOld()). And here
> is a tiny patch which change the same. I have confirmed that the bug is fixed
> and all regression tests pass.
>
> What do you think ?
> I will start a new thread and try to write a testcase if possible
> after reaching a consensus.

I think your analysis looks correct and we can raise this in a new thread.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

06 January 2023, 04:07:49

On Thursday, January 5, 2023 7:54 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> 
> On Thu, Jan 5, 2023 at 5:03 PM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Thursday, January 5, 2023 4:22 PM Dilip Kumar <dilipbalaut@gmail.com>
> wrote:
> > >
> 
> > Thanks for reporting the problem.
> >
> > After analyzing the behavior, I think it's a bug on publisher side
> > which is not directly related to parallel apply.
> >
> > I think the root reason is that we didn't try to send a stream
> > end(stream
> > abort) message to subscriber for the crashed transaction which was
> > streamed before.
> > The behavior is that, after restarting, the publisher will start to
> > decode the transaction that aborted due to crash, and when try to
> > stream the first change of that transaction, it will send a stream
> > start message but then it realizes that the transaction was aborted,
> > so it will enter the PG_CATCH block of
> > ReorderBufferProcessTXN() and call ReorderBufferResetTXN() which send
> > the stream stop message. And in this case, there would be a parallel
> > apply worker started on subscriber waiting for stream end message which
> will never come.
> 
> I suspected it but didn't analyze this.
> 
> > I think the same behavior happens for the non-parallel mode which will
> > cause a stream file left on subscriber and will not be cleaned until
> > the apply worker is restarted.
> > To fix it, I think we need to send a stream abort message when we are
> > cleaning up crashed transaction on publisher(e.g., in
> > ReorderBufferAbortOld()). And here is a tiny patch which change the
> > same. I have confirmed that the bug is fixed and all regression tests pass.
> >
> > What do you think ?
> > I will start a new thread and try to write a testcase if possible
> > after reaching a consensus.
> 
> I think your analysis looks correct and we can raise this in a new thread.

Thanks, I have started another thread[1]

Attach the parallel apply patch set here again. I didn't change the patch set,
attach it here just to let the CFbot keep testing it.

[1]
https://www.postgresql.org/message-id/OS0PR01MB5716A773F46768A1B75BE24394FB9%40OS0PR01MB5716.jpnprd01.prod.outlook.com

Best regards,
Hou zj

On Friday, January 6, 2023 3:29 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Hi,

Thanks for your comments.

> On Fri, Jan 6, 2023 at 12:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> 
> >
> > Yeah, I also don't think sending extra eight bytes with stream_start
> > message is worth it. But it is fine to mention the same in the
> > comments.
> 
> Right.

Added some comment.

> 
> > > 2.
> > >
> > > +     * XXX Additionally, we also stop the worker if the leader apply
> worker
> > > +     * serialize part of the transaction data due to a send timeout. This is
> > > +     * because the message could be partially written to the queue and
> there
> > > +     * is no way to clean the queue other than resending the message
> until it
> > > +     * succeeds. Instead of trying to send the data which anyway would
> have
> > > +     * been serialized and then letting the parallel apply worker deal with
> > > +     * the spurious message, we stop the worker.
> > > +     */
> > > +    if (winfo->serialize_changes ||
> > > +        list_length(ParallelApplyWorkerPool) >
> > > +        (max_parallel_apply_workers_per_subscription / 2))
> > >
> > > IMHO this reason (XXX Additionally, we also stop the worker if the
> > > leader apply worker serialize part of the transaction data due to a
> > > send timeout) for stopping the worker looks a bit hackish to me.  It
> > > may be a rare case so I am not talking about the performance but the
> > > reasoning behind stopping is not good. Ideally we should be able to
> > > clean up the message queue and reuse the worker.
> > >
> >
> > TBH, I don't know what is the better way to deal with this with the
> > current infrastructure. I thought we can do this as a separate
> > enhancement in the future.
> 
> Okay.
> 
> > > 3.
> > > +        else if (shmq_res == SHM_MQ_WOULD_BLOCK)
> > > +        {
> > > +            /* Replay the changes from the file, if any. */
> > > +            if (pa_has_spooled_message_pending())
> > > +            {
> > > +                pa_spooled_messages();
> > > +            }
> > >
> > > I think we do not need this pa_has_spooled_message_pending() function.
> > > Because this function is just calling pa_get_fileset_state() which
> > > is acquiring mutex and getting filestate then if the filestate is
> > > not FS_EMPTY then we call pa_spooled_messages() that will again call
> > > pa_get_fileset_state() which will again acquire mutex.  I think when
> > > the state is FS_SERIALIZE_IN_PROGRESS it will frequently call
> > > pa_get_fileset_state() consecutively 2 times, and I think we can
> > > easily achieve the same behavior with just one call.
> > >
> >
> > This is just to keep the code easy to follow. As this would be a rare
> > case, so thought of giving preference to code clarity.
> 
> I think the code will be simpler with just one function no? I mean instead of
> calling pa_has_spooled_message_pending() in if condition what if we directly
> call pa_spooled_messages();, this is anyway fetching the file_state and if the
> filestate is EMPTY then it can return false, and if it returns false we can execute
> the code which is there in else condition.  We might need to change the name
> of the function though.

Changed as suggested.

I have addressed all the comments and here is the new version patch set.
I also added some documents about the new lock and fixed some typos.

Attach the new version patch set.

Best regards,
Hou zj

Attachment

Re: Perform streaming logical transactions by background workers and parallel apply

From

Dilip Kumar

Date:

07 January 2023, 04:50:02

On Fri, Jan 6, 2023 at 3:38 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>

Looks good, but I feel in pa_process_spooled_messages_if_required()
function after getting the filestate the first check should be if
(filestate== FS_EMPTY) return false.  I mean why to process through
all the states if it is empty and we can directly exit.  It is not a
big deal so if you prefer the way it is then I have no objection to
it.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

07 January 2023, 05:42:59

On Saturday, January 7, 2023 12:50 PM Dilip Kumar <dilipbalaut@gmail.com>
> 
> On Fri, Jan 6, 2023 at 3:38 PM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com>
> wrote:
> >
> 
> Looks good, but I feel in pa_process_spooled_messages_if_required()
> function after getting the filestate the first check should be if (filestate==
> FS_EMPTY) return false.  I mean why to process through all the states if it is
> empty and we can directly exit.  It is not a big deal so if you prefer the way it is
> then I have no objection to it.

I think your suggestion looks good, I have adjusted the code.
I also rebase the patch set due to the recent commit c6e1f6.
And here is the new version patch set.

Best regards,
Hou zj

On Sunday, January 8, 2023 10:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Sat, Jan 7, 2023 at 2:25 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> 
> Today, I was analyzing this patch w.r.t recent commit c6e1f62e2c and found that
> pa_set_xact_state() should set the latch (wake up) for the leader worker as the
> leader could be waiting in pa_wait_for_xact_state(). What do you think? But
> otherwise, it should be okay w.r.t DDLs because this patch allows the leader
> worker to restart logical replication for subscription parameter change which will
> in turn stop/restart parallel workers if required.

Thanks for the analysis. I agree that it would be better to signal the leader
when setting the state to PARALLEL_TRANS_STARTED, otherwise it might slightly
delay the timing of catch the state change in pa_wait_for_xact_state(), so I
have updated the patch for the same. Besides, I also checked commit c6e1f62e2c,
I think DDL operation doesn't need to wake up the parallel apply worker
directly as the parallel apply worker doesn't start table sync and only
communicate with the leader, so I didn't find some other places that need to be
changed.

Attach the updated patch set.

Best regards,
Hou zj

Attachment

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

08 January 2023, 06:02:46

On Sunday, January 8, 2023 11:59 AM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote:
> On Sunday, January 8, 2023 10:14 AM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > On Sat, Jan 7, 2023 at 2:25 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> >
> > Today, I was analyzing this patch w.r.t recent commit c6e1f62e2c and
> > found that
> > pa_set_xact_state() should set the latch (wake up) for the leader
> > worker as the leader could be waiting in pa_wait_for_xact_state().
> > What do you think? But otherwise, it should be okay w.r.t DDLs because
> > this patch allows the leader worker to restart logical replication for
> > subscription parameter change which will in turn stop/restart parallel workers
> if required.
> 
> Thanks for the analysis. I agree that it would be better to signal the leader when
> setting the state to PARALLEL_TRANS_STARTED, otherwise it might slightly delay
> the timing of catch the state change in pa_wait_for_xact_state(), so I have
> updated the patch for the same. Besides, I also checked commit c6e1f62e2c, I
> think DDL operation doesn't need to wake up the parallel apply worker directly
> as the parallel apply worker doesn't start table sync and only communicate with
> the leader, so I didn't find some other places that need to be changed.
> 
> Attach the updated patch set.

Sorry, the commit message of 0001 was accidentally deleted, just attach
the same patch set again with commit message.

Attachment

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

09 January 2023, 08:51:03

On Sun, Jan 8, 2023 at 11:32 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> On Sunday, January 8, 2023 11:59 AM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote:
> > Attach the updated patch set.
>
> Sorry, the commit message of 0001 was accidentally deleted, just attach
> the same patch set again with commit message.
>

Pushed the first (0001) patch.

-- 
With Regards,
Amit Kapila.

RE: Perform streaming logical transactions by background workers and parallel apply

From

"Shinoda, Noriyoshi (PN Japan FSIP)"

Date:

09 January 2023, 09:32:01

Hi, Thanks for the great new feature.

Applied patches include adding wait events LogicalParallelApplyMain, LogicalParallelApplyStateChange. 
However, it seems that monitoring.sgml only contains descriptions for pg_locks. The attached patch adds relevant wait
eventinformation.
 
Please update if you have a better description.

Noriyoshi Shinoda
-----Original Message-----
From: Amit Kapila <amit.kapila16@gmail.com> 
Sent: Monday, January 9, 2023 5:51 PM
To: houzj.fnst@fujitsu.com
Cc: Masahiko Sawada <sawada.mshk@gmail.com>; wangw.fnst@fujitsu.com; Peter Smith <smithpb2250@gmail.com>;
shiy.fnst@fujitsu.com;PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>; Dilip Kumar <dilipbalaut@gmail.com>
 
Subject: Re: Perform streaming logical transactions by background workers and parallel apply

On Sun, Jan 8, 2023 at 11:32 AM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote:
>
> On Sunday, January 8, 2023 11:59 AM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote:
> > Attach the updated patch set.
>
> Sorry, the commit message of 0001 was accidentally deleted, just 
> attach the same patch set again with commit message.
>

Pushed the first (0001) patch.

--
With Regards,
Amit Kapila.

Attachment

monitoring_wait_event_v1.diff

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

09 January 2023, 10:15:20

On Monday, January 9, 2023 5:32 PM Shinoda, Noriyoshi (PN Japan FSIP) <noriyoshi.shinoda@hpe.com> wrote:
> 
> Hi, Thanks for the great new feature.
> 
> Applied patches include adding wait events LogicalParallelApplyMain,
> LogicalParallelApplyStateChange.
> However, it seems that monitoring.sgml only contains descriptions for
> pg_locks. The attached patch adds relevant wait event information.
> Please update if you have a better description.

Thanks for reporting. I think for LogicalParallelApplyStateChange we'd better
document it in a consistent style with LogicalSyncStateChange, so I have
slightly adjusted the patch for the same.

Best regards,
Hou zj

Attachment

v2-0001-document-the-newly-added-wait-event.patch

RE: Perform streaming logical transactions by background workers and parallel apply

From

"Shinoda, Noriyoshi (PN Japan FSIP)"

Date:

09 January 2023, 12:44:12

Thanks for the reply.

> Thanks for reporting. I think for LogicalParallelApplyStateChange we'd better document it in a consistent style with
LogicalSyncStateChange,
 
> so I have slightly adjusted the patch for the same.

I think the description in the patch you attached is better.

Regards,
Noriyoshi Shinoda

-----Original Message-----
From: houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> 
Sent: Monday, January 9, 2023 7:15 PM
To: Shinoda, Noriyoshi (PN Japan FSIP) <noriyoshi.shinoda@hpe.com>; Amit Kapila <amit.kapila16@gmail.com>
Cc: Masahiko Sawada <sawada.mshk@gmail.com>; wangw.fnst@fujitsu.com; Peter Smith <smithpb2250@gmail.com>;
shiy.fnst@fujitsu.com;PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>; Dilip Kumar <dilipbalaut@gmail.com>
 
Subject: RE: Perform streaming logical transactions by background workers and parallel apply

On Monday, January 9, 2023 5:32 PM Shinoda, Noriyoshi (PN Japan FSIP) <noriyoshi.shinoda@hpe.com> wrote:
> 
> Hi, Thanks for the great new feature.
> 
> Applied patches include adding wait events LogicalParallelApplyMain, 
> LogicalParallelApplyStateChange.
> However, it seems that monitoring.sgml only contains descriptions for 
> pg_locks. The attached patch adds relevant wait event information.
> Please update if you have a better description.

Thanks for reporting. I think for LogicalParallelApplyStateChange we'd better document it in a consistent style with
LogicalSyncStateChange,so I have slightly adjusted the patch for the same.
 

Best regards,
Hou zj

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

10 January 2023, 04:55:55

On Monday, January 9, 2023 4:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Sun, Jan 8, 2023 at 11:32 AM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Sunday, January 8, 2023 11:59 AM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> > > Attach the updated patch set.
> >
> > Sorry, the commit message of 0001 was accidentally deleted, just
> > attach the same patch set again with commit message.
> >
> 
> Pushed the first (0001) patch.

Thanks for pushing, here are the remaining patches.
I reordered the patch number to put patches that are easier to
commit in the front of others.

Best regards,
Hou zj

On Thursday, January 12, 2023 7:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Thu, Jan 12, 2023 at 4:21 PM shveta malik <shveta.malik@gmail.com> wrote:
> >
> > On Thu, Jan 12, 2023 at 10:34 AM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > >
> > > On Thu, Jan 12, 2023 at 9:54 AM Peter Smith <smithpb2250@gmail.com>
> wrote:
> > > >
> > > >
> > > > doc/src/sgml/monitoring.sgml
> > > >
> > > > 5. pg_stat_subscription
> > > >
> > > > @@ -3198,11 +3198,22 @@ SELECT pid, wait_event_type, wait_event
> > > > FROM pg_stat_activity WHERE wait_event i
> > > >
> > > >       <row>
> > > >        <entry role="catalog_table_entry"><para
> > > > role="column_definition">
> > > > +       <structfield>apply_leader_pid</structfield>
> <type>integer</type>
> > > > +      </para>
> > > > +      <para>
> > > > +       Process ID of the leader apply worker, if this process is a apply
> > > > +       parallel worker. NULL if this process is a leader apply worker or a
> > > > +       synchronization worker.
> > > > +      </para></entry>
> > > > +     </row>
> > > > +
> > > > +     <row>
> > > > +      <entry role="catalog_table_entry"><para
> > > > + role="column_definition">
> > > >         <structfield>relid</structfield> <type>oid</type>
> > > >        </para>
> > > >        <para>
> > > >         OID of the relation that the worker is synchronizing; null for the
> > > > -       main apply worker
> > > > +       main apply worker and the parallel apply worker
> > > >        </para></entry>
> > > >       </row>
> > > >
> > > > 5a.
> > > >
> > > > (Same as general comment #1 about terminology)
> > > >
> > > > "apply_leader_pid" --> "leader_apply_pid"
> > > >
> > >
> > > How about naming this as just leader_pid? I think it could be
> > > helpful in the future if we decide to parallelize initial sync (aka
> > > parallel
> > > copy) because then we could use this for the leader PID of parallel
> > > sync workers as well.
> > >
> > > --
> >
> > I still prefer leader_apply_pid.
> > leader_pid does not tell which 'operation' it belongs to. 'apply'
> > gives the clarity that it is apply related process.
> >
> 
> But then do you suggest that tomorrow if we allow parallel sync workers then
> we have a separate column leader_sync_pid? I think that doesn't sound like a
> good idea and moreover one can refer to docs for clarification.

I agree that leader_pid would be better not only for future parallel copy sync feature,
but also it's more consistent with the leader_pid column in pg_stat_activity.

And here is the version patch which addressed Peter's comments and renamed all
the related stuff to leader_pid.

Best Regards,
Hou zj

On Friday, January 13, 2023 1:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> 
> On Fri, Jan 13, 2023 at 1:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Jan 13, 2023 at 9:06 AM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > >
> > > On Fri, Jan 13, 2023 at 7:56 AM Peter Smith <smithpb2250@gmail.com>
> wrote:
> > > >
> > >
> > > >
> > > > 3.
> > > >
> > > >        <entry role="catalog_table_entry"><para
> > > > role="column_definition">
> > > > +       <structfield>leader_pid</structfield> <type>integer</type>
> > > > +      </para>
> > > > +      <para>
> > > > +       Process ID of the leader apply worker if this process is a parallel
> > > > +       apply worker; NULL if this process is a leader apply worker or
> does not
> > > > +       participate in parallel apply, or a synchronization worker
> > > > +      </para></entry>
> > > >
> > > > I felt this change is giving too many details and ended up just
> > > > muddying the water.
> > > >
> > >
> > > I see that we give a similar description for other parameters as well.
> > > For example leader_pid in pg_stat_activity,
> > >
> >
> > BTW, shouldn't we update leader_pid column in pg_stat_activity as well
> > to display apply leader PID for parallel apply workers? It will
> > currently display for other parallel operations like a parallel
> > vacuum, so I don't see a reason to not do the same for parallel apply
> > workers.
> 
> +1
> 
> The parallel apply workers have different properties than the parallel query
> workers since they execute different transactions and don't use group locking
> but it would be a good hint for users to show the leader and parallel apply
> worker processes are related. If users want to check only parallel query workers
> they can use the backend_type column.

Agreed, and changed as suggested.

Attach the new version patch set which address the comments so far.

Best Regards,
Hou zj

On Fri, Jan 13, 2023 at 3:44 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> On Friday, January 13, 2023 1:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > On Thu, Jan 12, 2023 at 9:34 PM houzj.fnst@fujitsu.com
> > <houzj.fnst@fujitsu.com> wrote:

In GetLogicalLeaderApplyWorker(), we can use shared lock instead
exclusive as we are just reading the workers array. Also, the function
name looks a bit odd to me, so I changed it to
GetLeaderApplyWorkerPid(). Also, it is better to use InvalidPid
instead of 0 when there is no valid value for leader_pid in
GetLeaderApplyWorkerPid(). Apart from that, I have made minor changes
in the comments, docs, and commit message. I am planning to push this
next week by Tuesday unless you or others have any major comments.

-- 
With Regards,
Amit Kapila.

Attachment

v81-0001-Display-the-leader-apply-worker-s-PID-for-parall.patch

Re: Perform streaming logical transactions by background workers and parallel apply

From

Tomas Vondra

Date:

15 January 2023, 17:09:00

Hi,

I think there's a bug in how get_transaction_apply_action() interacts
with handle_streamed_transaction() to decide whether the transaction is
streamed or not. Originally, the code was simply:

    /* not in streaming mode */
    if (!in_streamed_transaction)
        return false;

But now this decision was moved to get_transaction_apply_action(), which
does this:

    if (am_parallel_apply_worker())
    {
        return TRANS_PARALLEL_APPLY;
    }
    else if (in_remote_transaction)
    {
        return TRANS_LEADER_APPLY;
    }

and handle_streamed_transaction() then uses the result like this:

    /* not in streaming mode */
    if (apply_action == TRANS_LEADER_APPLY)
        return false;

Notice this is not equal to the original behavior, because the two flags
(in_remote_transaction and in_streamed_transaction) are not inverse.
That is,

   in_remote_transaction=false

does not imply we're processing streamed transaction. It's allowed both
flags are false, i.e. a change may be "non-transactional" and not
streamed, though the only example of such thing in the protocol are
logical messages. Which are however ignored in the apply worker, so I'm
not surprised no existing test failed on this.

So I think get_transaction_apply_action() should do this:

    if (am_parallel_apply_worker())
    {
        return TRANS_PARALLEL_APPLY;
    }
    else if (!in_streamed_transaction)
    {
        return TRANS_LEADER_APPLY;
    }

FWIW I've noticed this after rebasing the sequence decoding patch, which
adds another type of protocol message with the transactional vs.
non-transactional behavior, similar to "logical messages" except that in
this case the worker does not ignore that.

Also, I think get_transaction_apply_action() would deserve better
comments explaining how/why it makes the decisions.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Perform streaming logical transactions by background workers and parallel apply

From

Kyotaro Horiguchi

Date:

16 January 2023, 03:03:07

At Tue, 10 Jan 2023 12:01:43 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in 
> On Tue, Jan 10, 2023 at 11:16 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > Although I don't see a technical difference between the two, all the
> > other occurances including the just above (except test_shm_mq) use
> > "could not". A faint memory in my non-durable memory tells me that we
> > have a policy that we use "can/could not" than "unable".
> >
> 
> Right, it is mentioned in docs [1] (see section "Tricky Words to Avoid").

Thanks for confirmation.

> Can you please start a new thread and post these changes as we are
> proposing to change existing message as well?

All right.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Perform streaming logical transactions by background workers and parallel apply

From

Peter Smith

Date:

16 January 2023, 04:54:00

Here are some review comments for v81-0001.

======

Commit Message

1.

Additionally, update the leader_pid column in pg_stat_activity as well to
display the PID of the leader apply worker for parallel apply workers.

~

Probably it should not say both "Additionally" and "as well" in the
same sentence.

======

src/backend/replication/logical/launcher.c

2.

 /*
+ * Return the pid of the leader apply worker if the given pid is the pid of a
+ * parallel apply worker, otherwise return InvalidPid.
+ */
+pid_t
+GetLeaderApplyWorkerPid(pid_t pid)
+{
+ int leader_pid = InvalidPid;
+ int i;
+
+ LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
+
+ for (i = 0; i < max_logical_replication_workers; i++)
+ {
+ LogicalRepWorker *w = &LogicalRepCtx->workers[i];
+
+ if (isParallelApplyWorker(w) && w->proc && pid == w->proc->pid)
+ {
+ leader_pid = w->leader_pid;
+ break;
+ }
+ }
+
+ LWLockRelease(LogicalRepWorkerLock);
+
+ return leader_pid;
+}

2a.
IIUC the IsParallelApplyWorker macro does nothing except check that
the leader_pid is not InvalidPid anyway, so AFAIK this algorithm does
not benefit from using this macro because we will want to return
InvalidPid anyway if the given pid matches.

So the inner condition can just say:

if (w->proc && w->proc->pid == pid)
{
leader_pid = w->leader_pid;
break;
}

~

2b.
A possible alternative comment.

BEFORE
Return the pid of the leader apply worker if the given pid is the pid
of a parallel apply worker, otherwise return InvalidPid.


AFTER
If the given pid has a leader apply worker then return the leader pid,
otherwise, return InvalidPid.

======

src/backend/utils/adt/pgstatfuncs.c

3.

@@ -434,6 +435,16 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
  values[28] = Int32GetDatum(leader->pid);
  nulls[28] = false;
  }
+ else
+ {
+ int leader_pid = GetLeaderApplyWorkerPid(beentry->st_procpid);
+
+ if (leader_pid != InvalidPid)
+ {
+ values[28] = Int32GetDatum(leader_pid);
+ nulls[28] = false;
+ }
+

3a.
There is an existing comment preceding this if/else but it refers only
to leaders of parallel groups. Should that comment be updated to
mention the leader apply worker too?

~

3b.
It may be unrelated to this patch, but it seems strange to me that the
nulls[28]/values[28] assignments are done where they are. Every other
nulls/values assignment of this function here is pretty much in the
correct numerical order except this one, so IMO this code ought to be
relocated to later in this same function.

------
Kind Regards,
Peter Smith.
Fujitsu Australia.

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

16 January 2023, 06:19:36

On Sun, Jan 15, 2023 at 10:39 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> I think there's a bug in how get_transaction_apply_action() interacts
> with handle_streamed_transaction() to decide whether the transaction is
> streamed or not. Originally, the code was simply:
>
>     /* not in streaming mode */
>     if (!in_streamed_transaction)
>         return false;
>
> But now this decision was moved to get_transaction_apply_action(), which
> does this:
>
>     if (am_parallel_apply_worker())
>     {
>         return TRANS_PARALLEL_APPLY;
>     }
>     else if (in_remote_transaction)
>     {
>         return TRANS_LEADER_APPLY;
>     }
>
> and handle_streamed_transaction() then uses the result like this:
>
>     /* not in streaming mode */
>     if (apply_action == TRANS_LEADER_APPLY)
>         return false;
>
> Notice this is not equal to the original behavior, because the two flags
> (in_remote_transaction and in_streamed_transaction) are not inverse.
> That is,
>
>    in_remote_transaction=false
>
> does not imply we're processing streamed transaction. It's allowed both
> flags are false, i.e. a change may be "non-transactional" and not
> streamed, though the only example of such thing in the protocol are
> logical messages. Which are however ignored in the apply worker, so I'm
> not surprised no existing test failed on this.
>

Right, this is the reason we didn't catch it in our testing.

> So I think get_transaction_apply_action() should do this:
>
>     if (am_parallel_apply_worker())
>     {
>         return TRANS_PARALLEL_APPLY;
>     }
>     else if (!in_streamed_transaction)
>     {
>         return TRANS_LEADER_APPLY;
>     }
>

Yeah, something like this would work but some of the callers other
than handle_streamed_transaction() also need to be changed. See
attached.

> FWIW I've noticed this after rebasing the sequence decoding patch, which
> adds another type of protocol message with the transactional vs.
> non-transactional behavior, similar to "logical messages" except that in
> this case the worker does not ignore that.
>
> Also, I think get_transaction_apply_action() would deserve better
> comments explaining how/why it makes the decisions.
>

Okay, I have added the comments in get_transaction_apply_action() and
updated the comments to refer to the enum TransApplyAction where all
the actions are explained.

-- 
With Regards,
Amit Kapila.

Attachment

v1-0001-Fix-the-code-to-decide-the-apply-action.patch

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

16 January 2023, 06:40:50

On Mon, Jan 16, 2023 at 10:24 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> 2.
>
>  /*
> + * Return the pid of the leader apply worker if the given pid is the pid of a
> + * parallel apply worker, otherwise return InvalidPid.
> + */
> +pid_t
> +GetLeaderApplyWorkerPid(pid_t pid)
> +{
> + int leader_pid = InvalidPid;
> + int i;
> +
> + LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
> +
> + for (i = 0; i < max_logical_replication_workers; i++)
> + {
> + LogicalRepWorker *w = &LogicalRepCtx->workers[i];
> +
> + if (isParallelApplyWorker(w) && w->proc && pid == w->proc->pid)
> + {
> + leader_pid = w->leader_pid;
> + break;
> + }
> + }
> +
> + LWLockRelease(LogicalRepWorkerLock);
> +
> + return leader_pid;
> +}
>
> 2a.
> IIUC the IsParallelApplyWorker macro does nothing except check that
> the leader_pid is not InvalidPid anyway, so AFAIK this algorithm does
> not benefit from using this macro because we will want to return
> InvalidPid anyway if the given pid matches.
>
> So the inner condition can just say:
>
> if (w->proc && w->proc->pid == pid)
> {
> leader_pid = w->leader_pid;
> break;
> }
>

Yeah, this should also work but I feel the current one is explicit and
more clear.

> ~
>
> 2b.
> A possible alternative comment.
>
> BEFORE
> Return the pid of the leader apply worker if the given pid is the pid
> of a parallel apply worker, otherwise return InvalidPid.
>
>
> AFTER
> If the given pid has a leader apply worker then return the leader pid,
> otherwise, return InvalidPid.
>

I don't think that is an improvement.

> ======
>
> src/backend/utils/adt/pgstatfuncs.c
>
> 3.
>
> @@ -434,6 +435,16 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
>   values[28] = Int32GetDatum(leader->pid);
>   nulls[28] = false;
>   }
> + else
> + {
> + int leader_pid = GetLeaderApplyWorkerPid(beentry->st_procpid);
> +
> + if (leader_pid != InvalidPid)
> + {
> + values[28] = Int32GetDatum(leader_pid);
> + nulls[28] = false;
> + }
> +
>
> 3a.
> There is an existing comment preceding this if/else but it refers only
> to leaders of parallel groups. Should that comment be updated to
> mention the leader apply worker too?
>

Yeah, we can slightly adjust the comments. How about something like the below:
index 415e711729..7eb668634a 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -410,9 +410,9 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)

                        /*
                         * If a PGPROC entry was retrieved, display
wait events and lock
-                        * group leader information if any.  To avoid
extra overhead, no
-                        * extra lock is being held, so there is no guarantee of
-                        * consistency across multiple rows.
+                        * group leader or apply leader information if
any.  To avoid extra
+                        * overhead, no extra lock is being held, so
there is no guarantee
+                        * of consistency across multiple rows.
                         */
                        if (proc != NULL)
                        {
@@ -428,7 +428,7 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
                                /*
                                 * Show the leader only for active
parallel workers.  This
                                 * leaves the field as NULL for the
leader of a parallel
-                                * group.
+                                * group or the leader of a parallel apply.
                                 */
                                if (leader && leader->pid !=
beentry->st_procpid)


> ~
>
> 3b.
> It may be unrelated to this patch, but it seems strange to me that the
> nulls[28]/values[28] assignments are done where they are. Every other
> nulls/values assignment of this function here is pretty much in the
> correct numerical order except this one, so IMO this code ought to be
> relocated to later in this same function.
>

This is not related to the current patch but I see there is merit in
the current coding as it is better to retrieve all the fields of proc
together.

-- 
With Regards,
Amit Kapila.

Re: Perform streaming logical transactions by background workers and parallel apply

From

Tomas Vondra

Date:

16 January 2023, 16:33:32

Hi Amit,

Thanks for the patch, the changes seem reasonable to me and it does fix
the issue in the sequence decoding patch.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Perform streaming logical transactions by background workers and parallel apply

From

Peter Smith

Date:

16 January 2023, 21:42:57

On Mon, Jan 16, 2023 at 5:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jan 16, 2023 at 10:24 AM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > 2.
> >
> >  /*
> > + * Return the pid of the leader apply worker if the given pid is the pid of a
> > + * parallel apply worker, otherwise return InvalidPid.
> > + */
> > +pid_t
> > +GetLeaderApplyWorkerPid(pid_t pid)
> > +{
> > + int leader_pid = InvalidPid;
> > + int i;
> > +
> > + LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
> > +
> > + for (i = 0; i < max_logical_replication_workers; i++)
> > + {
> > + LogicalRepWorker *w = &LogicalRepCtx->workers[i];
> > +
> > + if (isParallelApplyWorker(w) && w->proc && pid == w->proc->pid)
> > + {
> > + leader_pid = w->leader_pid;
> > + break;
> > + }
> > + }
> > +
> > + LWLockRelease(LogicalRepWorkerLock);
> > +
> > + return leader_pid;
> > +}
> >
> > 2a.
> > IIUC the IsParallelApplyWorker macro does nothing except check that
> > the leader_pid is not InvalidPid anyway, so AFAIK this algorithm does
> > not benefit from using this macro because we will want to return
> > InvalidPid anyway if the given pid matches.
> >
> > So the inner condition can just say:
> >
> > if (w->proc && w->proc->pid == pid)
> > {
> > leader_pid = w->leader_pid;
> > break;
> > }
> >
>
> Yeah, this should also work but I feel the current one is explicit and
> more clear.

OK.

But, I have one last comment about this function -- I saw there are
already other functions that iterate max_logical_replication_workers
like this looking for things:
- logicalrep_worker_find
- logicalrep_workers_find
- logicalrep_worker_launch
- logicalrep_sync_worker_count

So I felt this new function (currently called GetLeaderApplyWorkerPid)
ought to be named similarly to those ones. e.g. call it something like
 "logicalrep_worker_find_pa_leader_pid".

>
> > ~
> >
> > 2b.
> > A possible alternative comment.
> >
> > BEFORE
> > Return the pid of the leader apply worker if the given pid is the pid
> > of a parallel apply worker, otherwise return InvalidPid.
> >
> >
> > AFTER
> > If the given pid has a leader apply worker then return the leader pid,
> > otherwise, return InvalidPid.
> >
>
> I don't think that is an improvement.
>
> > ======
> >
> > src/backend/utils/adt/pgstatfuncs.c
> >
> > 3.
> >
> > @@ -434,6 +435,16 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
> >   values[28] = Int32GetDatum(leader->pid);
> >   nulls[28] = false;
> >   }
> > + else
> > + {
> > + int leader_pid = GetLeaderApplyWorkerPid(beentry->st_procpid);
> > +
> > + if (leader_pid != InvalidPid)
> > + {
> > + values[28] = Int32GetDatum(leader_pid);
> > + nulls[28] = false;
> > + }
> > +
> >
> > 3a.
> > There is an existing comment preceding this if/else but it refers only
> > to leaders of parallel groups. Should that comment be updated to
> > mention the leader apply worker too?
> >
>
> Yeah, we can slightly adjust the comments. How about something like the below:
> index 415e711729..7eb668634a 100644
> --- a/src/backend/utils/adt/pgstatfuncs.c
> +++ b/src/backend/utils/adt/pgstatfuncs.c
> @@ -410,9 +410,9 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
>
>                         /*
>                          * If a PGPROC entry was retrieved, display
> wait events and lock
> -                        * group leader information if any.  To avoid
> extra overhead, no
> -                        * extra lock is being held, so there is no guarantee of
> -                        * consistency across multiple rows.
> +                        * group leader or apply leader information if
> any.  To avoid extra
> +                        * overhead, no extra lock is being held, so
> there is no guarantee
> +                        * of consistency across multiple rows.
>                          */
>                         if (proc != NULL)
>                         {
> @@ -428,7 +428,7 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
>                                 /*
>                                  * Show the leader only for active
> parallel workers.  This
>                                  * leaves the field as NULL for the
> leader of a parallel
> -                                * group.
> +                                * group or the leader of a parallel apply.
>                                  */
>                                 if (leader && leader->pid !=
> beentry->st_procpid)
>

The updated comment LGTM.

------
Kind Regards,
Peter Smith.
Fujitsu Australia

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

17 January 2023, 02:21:04

On Tuesday, January 17, 2023 5:43 AM Peter Smith <smithpb2250@gmail.com> wrote:
> 
> On Mon, Jan 16, 2023 at 5:41 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > On Mon, Jan 16, 2023 at 10:24 AM Peter Smith <smithpb2250@gmail.com>
> wrote:
> > >
> > > 2.
> > >
> > >  /*
> > > + * Return the pid of the leader apply worker if the given pid is
> > > +the pid of a
> > > + * parallel apply worker, otherwise return InvalidPid.
> > > + */
> > > +pid_t
> > > +GetLeaderApplyWorkerPid(pid_t pid)
> > > +{
> > > + int leader_pid = InvalidPid;
> > > + int i;
> > > +
> > > + LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
> > > +
> > > + for (i = 0; i < max_logical_replication_workers; i++) {
> > > + LogicalRepWorker *w = &LogicalRepCtx->workers[i];
> > > +
> > > + if (isParallelApplyWorker(w) && w->proc && pid == w->proc->pid) {
> > > + leader_pid = w->leader_pid; break; } }
> > > +
> > > + LWLockRelease(LogicalRepWorkerLock);
> > > +
> > > + return leader_pid;
> > > +}
> > >
> > > 2a.
> > > IIUC the IsParallelApplyWorker macro does nothing except check that
> > > the leader_pid is not InvalidPid anyway, so AFAIK this algorithm
> > > does not benefit from using this macro because we will want to
> > > return InvalidPid anyway if the given pid matches.
> > >
> > > So the inner condition can just say:
> > >
> > > if (w->proc && w->proc->pid == pid)
> > > {
> > > leader_pid = w->leader_pid;
> > > break;
> > > }
> > >
> >
> > Yeah, this should also work but I feel the current one is explicit and
> > more clear.
> 
> OK.
> 
> But, I have one last comment about this function -- I saw there are already
> other functions that iterate max_logical_replication_workers like this looking
> for things:
> - logicalrep_worker_find
> - logicalrep_workers_find
> - logicalrep_worker_launch
> - logicalrep_sync_worker_count
> 
> So I felt this new function (currently called GetLeaderApplyWorkerPid) ought
> to be named similarly to those ones. e.g. call it something like
> "logicalrep_worker_find_pa_leader_pid".
> 

I am not sure we can use the name, because currently all the API name in launcher that
used by other module(not related to subscription) are like
AxxBxx style(see the functions in logicallauncher.h).
logicalrep_worker_xxx style functions are currently only declared in
worker_internal.h.

Best regards,
Hou zj

Re: Perform streaming logical transactions by background workers and parallel apply

From

Masahiko Sawada

Date:

17 January 2023, 03:05:09

On Mon, Jan 16, 2023 at 3:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, Jan 15, 2023 at 10:39 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> >
> > I think there's a bug in how get_transaction_apply_action() interacts
> > with handle_streamed_transaction() to decide whether the transaction is
> > streamed or not. Originally, the code was simply:
> >
> >     /* not in streaming mode */
> >     if (!in_streamed_transaction)
> >         return false;
> >
> > But now this decision was moved to get_transaction_apply_action(), which
> > does this:
> >
> >     if (am_parallel_apply_worker())
> >     {
> >         return TRANS_PARALLEL_APPLY;
> >     }
> >     else if (in_remote_transaction)
> >     {
> >         return TRANS_LEADER_APPLY;
> >     }
> >
> > and handle_streamed_transaction() then uses the result like this:
> >
> >     /* not in streaming mode */
> >     if (apply_action == TRANS_LEADER_APPLY)
> >         return false;
> >
> > Notice this is not equal to the original behavior, because the two flags
> > (in_remote_transaction and in_streamed_transaction) are not inverse.
> > That is,
> >
> >    in_remote_transaction=false
> >
> > does not imply we're processing streamed transaction. It's allowed both
> > flags are false, i.e. a change may be "non-transactional" and not
> > streamed, though the only example of such thing in the protocol are
> > logical messages. Which are however ignored in the apply worker, so I'm
> > not surprised no existing test failed on this.
> >
>
> Right, this is the reason we didn't catch it in our testing.
>
> > So I think get_transaction_apply_action() should do this:
> >
> >     if (am_parallel_apply_worker())
> >     {
> >         return TRANS_PARALLEL_APPLY;
> >     }
> >     else if (!in_streamed_transaction)
> >     {
> >         return TRANS_LEADER_APPLY;
> >     }
> >
>
> Yeah, something like this would work but some of the callers other
> than handle_streamed_transaction() also need to be changed. See
> attached.
>
> > FWIW I've noticed this after rebasing the sequence decoding patch, which
> > adds another type of protocol message with the transactional vs.
> > non-transactional behavior, similar to "logical messages" except that in
> > this case the worker does not ignore that.
> >
> > Also, I think get_transaction_apply_action() would deserve better
> > comments explaining how/why it makes the decisions.
> >
>
> Okay, I have added the comments in get_transaction_apply_action() and
> updated the comments to refer to the enum TransApplyAction where all
> the actions are explained.

Thank you for the patch.

@@ -1710,6 +1712,7 @@ apply_handle_stream_stop(StringInfo s)
        }

        in_streamed_transaction = false;
+       stream_xid = InvalidTransactionId;

We reset stream_xid also in stream_close_file() but probably it's no
longer necessary?

How about adding an assertion in apply_handle_stream_start() to make
sure the stream_xid is invalid?

---
It's not related to this issue but I realized that if the action
returned by get_transaction_apply_action() is not handled in the
switch statement, we do only Assert(false). Is it better to raise an
error like "unexpected apply action %d" just in case in order to
detect failure cases also in the production environment?


Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

17 January 2023, 03:29:45

On Tue, Jan 17, 2023 at 8:35 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Jan 16, 2023 at 3:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > Okay, I have added the comments in get_transaction_apply_action() and
> > updated the comments to refer to the enum TransApplyAction where all
> > the actions are explained.
>
> Thank you for the patch.
>
> @@ -1710,6 +1712,7 @@ apply_handle_stream_stop(StringInfo s)
>         }
>
>         in_streamed_transaction = false;
> +       stream_xid = InvalidTransactionId;
>
> We reset stream_xid also in stream_close_file() but probably it's no
> longer necessary?
>

I think so.

> How about adding an assertion in apply_handle_stream_start() to make
> sure the stream_xid is invalid?
>

I think it would be better to add such an assert in
apply_handle_begin/apply_handle_begin_prepare because there won't be a
problem if we start_stream message even when stream_xid is valid.
However, maybe it is better to add in all three functions
(apply_handle_begin/apply_handle_begin_prepare/apply_handle_stream_start).
What do you think?

> ---
> It's not related to this issue but I realized that if the action
> returned by get_transaction_apply_action() is not handled in the
> switch statement, we do only Assert(false). Is it better to raise an
> error like "unexpected apply action %d" just in case in order to
> detect failure cases also in the production environment?
>

Yeah, that may be better. Shall we do that as part of this patch only
or as a separate patch?

-- 
With Regards,
Amit Kapila.

Re: Perform streaming logical transactions by background workers and parallel apply

From

Peter Smith

Date:

17 January 2023, 03:32:28

On Tue, Jan 17, 2023 at 1:21 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, January 17, 2023 5:43 AM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > On Mon, Jan 16, 2023 at 5:41 PM Amit Kapila <amit.kapila16@gmail.com>
> > wrote:
> > >
> > > On Mon, Jan 16, 2023 at 10:24 AM Peter Smith <smithpb2250@gmail.com>
> > wrote:
> > > >
> > > > 2.
> > > >
> > > >  /*
> > > > + * Return the pid of the leader apply worker if the given pid is
> > > > +the pid of a
> > > > + * parallel apply worker, otherwise return InvalidPid.
> > > > + */
> > > > +pid_t
> > > > +GetLeaderApplyWorkerPid(pid_t pid)
> > > > +{
> > > > + int leader_pid = InvalidPid;
> > > > + int i;
> > > > +
> > > > + LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
> > > > +
> > > > + for (i = 0; i < max_logical_replication_workers; i++) {
> > > > + LogicalRepWorker *w = &LogicalRepCtx->workers[i];
> > > > +
> > > > + if (isParallelApplyWorker(w) && w->proc && pid == w->proc->pid) {
> > > > + leader_pid = w->leader_pid; break; } }
> > > > +
> > > > + LWLockRelease(LogicalRepWorkerLock);
> > > > +
> > > > + return leader_pid;
> > > > +}
> > > >
> > > > 2a.
> > > > IIUC the IsParallelApplyWorker macro does nothing except check that
> > > > the leader_pid is not InvalidPid anyway, so AFAIK this algorithm
> > > > does not benefit from using this macro because we will want to
> > > > return InvalidPid anyway if the given pid matches.
> > > >
> > > > So the inner condition can just say:
> > > >
> > > > if (w->proc && w->proc->pid == pid)
> > > > {
> > > > leader_pid = w->leader_pid;
> > > > break;
> > > > }
> > > >
> > >
> > > Yeah, this should also work but I feel the current one is explicit and
> > > more clear.
> >
> > OK.
> >
> > But, I have one last comment about this function -- I saw there are already
> > other functions that iterate max_logical_replication_workers like this looking
> > for things:
> > - logicalrep_worker_find
> > - logicalrep_workers_find
> > - logicalrep_worker_launch
> > - logicalrep_sync_worker_count
> >
> > So I felt this new function (currently called GetLeaderApplyWorkerPid) ought
> > to be named similarly to those ones. e.g. call it something like
> > "logicalrep_worker_find_pa_leader_pid".
> >
>
> I am not sure we can use the name, because currently all the API name in launcher that
> used by other module(not related to subscription) are like
> AxxBxx style(see the functions in logicallauncher.h).
> logicalrep_worker_xxx style functions are currently only declared in
> worker_internal.h.
>

OK. I didn't know there was another header convention that you were
following.  In that case, it is fine to leave the name as-is.

------
Kind Regards,
Peter Smith.
Fujitsu Australia

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

17 January 2023, 03:37:07

On Tuesday, January 17, 2023 11:32 AM Peter Smith <smithpb2250@gmail.com> wrote:
> 
> On Tue, Jan 17, 2023 at 1:21 PM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Tuesday, January 17, 2023 5:43 AM Peter Smith
> <smithpb2250@gmail.com> wrote:
> > >
> > > On Mon, Jan 16, 2023 at 5:41 PM Amit Kapila
> > > <amit.kapila16@gmail.com>
> > > wrote:
> > > >
> > > > On Mon, Jan 16, 2023 at 10:24 AM Peter Smith
> > > > <smithpb2250@gmail.com>
> > > wrote:
> > > > >
> > > > > 2.
> > > > >
> > > > >  /*
> > > > > + * Return the pid of the leader apply worker if the given pid
> > > > > +is the pid of a
> > > > > + * parallel apply worker, otherwise return InvalidPid.
> > > > > + */
> > > > > +pid_t
> > > > > +GetLeaderApplyWorkerPid(pid_t pid) {  int leader_pid =
> > > > > +InvalidPid;  int i;
> > > > > +
> > > > > + LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
> > > > > +
> > > > > + for (i = 0; i < max_logical_replication_workers; i++) {
> > > > > + LogicalRepWorker *w = &LogicalRepCtx->workers[i];
> > > > > +
> > > > > + if (isParallelApplyWorker(w) && w->proc && pid ==
> > > > > + w->proc->pid) { leader_pid = w->leader_pid; break; } }
> > > > > +
> > > > > + LWLockRelease(LogicalRepWorkerLock);
> > > > > +
> > > > > + return leader_pid;
> > > > > +}
> > > > >
> > > > > 2a.
> > > > > IIUC the IsParallelApplyWorker macro does nothing except check
> > > > > that the leader_pid is not InvalidPid anyway, so AFAIK this
> > > > > algorithm does not benefit from using this macro because we will
> > > > > want to return InvalidPid anyway if the given pid matches.
> > > > >
> > > > > So the inner condition can just say:
> > > > >
> > > > > if (w->proc && w->proc->pid == pid) { leader_pid =
> > > > > w->leader_pid; break; }
> > > > >
> > > >
> > > > Yeah, this should also work but I feel the current one is explicit
> > > > and more clear.
> > >
> > > OK.
> > >
> > > But, I have one last comment about this function -- I saw there are
> > > already other functions that iterate max_logical_replication_workers
> > > like this looking for things:
> > > - logicalrep_worker_find
> > > - logicalrep_workers_find
> > > - logicalrep_worker_launch
> > > - logicalrep_sync_worker_count
> > >
> > > So I felt this new function (currently called
> > > GetLeaderApplyWorkerPid) ought to be named similarly to those ones.
> > > e.g. call it something like "logicalrep_worker_find_pa_leader_pid".
> > >
> >
> > I am not sure we can use the name, because currently all the API name
> > in launcher that used by other module(not related to subscription) are
> > like AxxBxx style(see the functions in logicallauncher.h).
> > logicalrep_worker_xxx style functions are currently only declared in
> > worker_internal.h.
> >
> 
> OK. I didn't know there was another header convention that you were following.
> In that case, it is fine to leave the name as-is.

Thanks for confirming!

Attach the new version 0001 patch which addressed all other comments.

Best regards,
Hou zj

Attachment

v82-0001-Display-the-leader-apply-worker-s-PID-for-parall.patch

Re: Perform streaming logical transactions by background workers and parallel apply

From

Peter Smith

Date:

17 January 2023, 03:48:34

On Tue, Jan 17, 2023 at 2:37 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, January 17, 2023 11:32 AM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > On Tue, Jan 17, 2023 at 1:21 PM houzj.fnst@fujitsu.com
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > On Tuesday, January 17, 2023 5:43 AM Peter Smith
> > <smithpb2250@gmail.com> wrote:
> > > >
> > > > On Mon, Jan 16, 2023 at 5:41 PM Amit Kapila
> > > > <amit.kapila16@gmail.com>
> > > > wrote:
> > > > >
> > > > > On Mon, Jan 16, 2023 at 10:24 AM Peter Smith
> > > > > <smithpb2250@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > 2.
> > > > > >
> > > > > >  /*
> > > > > > + * Return the pid of the leader apply worker if the given pid
> > > > > > +is the pid of a
> > > > > > + * parallel apply worker, otherwise return InvalidPid.
> > > > > > + */
> > > > > > +pid_t
> > > > > > +GetLeaderApplyWorkerPid(pid_t pid) {  int leader_pid =
> > > > > > +InvalidPid;  int i;
> > > > > > +
> > > > > > + LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
> > > > > > +
> > > > > > + for (i = 0; i < max_logical_replication_workers; i++) {
> > > > > > + LogicalRepWorker *w = &LogicalRepCtx->workers[i];
> > > > > > +
> > > > > > + if (isParallelApplyWorker(w) && w->proc && pid ==
> > > > > > + w->proc->pid) { leader_pid = w->leader_pid; break; } }
> > > > > > +
> > > > > > + LWLockRelease(LogicalRepWorkerLock);
> > > > > > +
> > > > > > + return leader_pid;
> > > > > > +}
> > > > > >
> > > > > > 2a.
> > > > > > IIUC the IsParallelApplyWorker macro does nothing except check
> > > > > > that the leader_pid is not InvalidPid anyway, so AFAIK this
> > > > > > algorithm does not benefit from using this macro because we will
> > > > > > want to return InvalidPid anyway if the given pid matches.
> > > > > >
> > > > > > So the inner condition can just say:
> > > > > >
> > > > > > if (w->proc && w->proc->pid == pid) { leader_pid =
> > > > > > w->leader_pid; break; }
> > > > > >
> > > > >
> > > > > Yeah, this should also work but I feel the current one is explicit
> > > > > and more clear.
> > > >
> > > > OK.
> > > >
> > > > But, I have one last comment about this function -- I saw there are
> > > > already other functions that iterate max_logical_replication_workers
> > > > like this looking for things:
> > > > - logicalrep_worker_find
> > > > - logicalrep_workers_find
> > > > - logicalrep_worker_launch
> > > > - logicalrep_sync_worker_count
> > > >
> > > > So I felt this new function (currently called
> > > > GetLeaderApplyWorkerPid) ought to be named similarly to those ones.
> > > > e.g. call it something like "logicalrep_worker_find_pa_leader_pid".
> > > >
> > >
> > > I am not sure we can use the name, because currently all the API name
> > > in launcher that used by other module(not related to subscription) are
> > > like AxxBxx style(see the functions in logicallauncher.h).
> > > logicalrep_worker_xxx style functions are currently only declared in
> > > worker_internal.h.
> > >
> >
> > OK. I didn't know there was another header convention that you were following.
> > In that case, it is fine to leave the name as-is.
>
> Thanks for confirming!
>
> Attach the new version 0001 patch which addressed all other comments.
>

OK. I checked the differences between patches v81-0001/v82-0001 and
found everything I was expecting to see.

I have no more review comments for v82-0001.

------
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Perform streaming logical transactions by background workers and parallel apply

From

shveta malik

Date:

17 January 2023, 04:33:30

On Tue, Jan 17, 2023 at 9:07 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, January 17, 2023 11:32 AM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > On Tue, Jan 17, 2023 at 1:21 PM houzj.fnst@fujitsu.com
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > On Tuesday, January 17, 2023 5:43 AM Peter Smith
> > <smithpb2250@gmail.com> wrote:
> > > >
> > > > On Mon, Jan 16, 2023 at 5:41 PM Amit Kapila
> > > > <amit.kapila16@gmail.com>
> > > > wrote:
> > > > >
> > > > > On Mon, Jan 16, 2023 at 10:24 AM Peter Smith
> > > > > <smithpb2250@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > 2.
> > > > > >
> > > > > >  /*
> > > > > > + * Return the pid of the leader apply worker if the given pid
> > > > > > +is the pid of a
> > > > > > + * parallel apply worker, otherwise return InvalidPid.
> > > > > > + */
> > > > > > +pid_t
> > > > > > +GetLeaderApplyWorkerPid(pid_t pid) {  int leader_pid =
> > > > > > +InvalidPid;  int i;
> > > > > > +
> > > > > > + LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
> > > > > > +
> > > > > > + for (i = 0; i < max_logical_replication_workers; i++) {
> > > > > > + LogicalRepWorker *w = &LogicalRepCtx->workers[i];
> > > > > > +
> > > > > > + if (isParallelApplyWorker(w) && w->proc && pid ==
> > > > > > + w->proc->pid) { leader_pid = w->leader_pid; break; } }
> > > > > > +
> > > > > > + LWLockRelease(LogicalRepWorkerLock);
> > > > > > +
> > > > > > + return leader_pid;
> > > > > > +}
> > > > > >
> > > > > > 2a.
> > > > > > IIUC the IsParallelApplyWorker macro does nothing except check
> > > > > > that the leader_pid is not InvalidPid anyway, so AFAIK this
> > > > > > algorithm does not benefit from using this macro because we will
> > > > > > want to return InvalidPid anyway if the given pid matches.
> > > > > >
> > > > > > So the inner condition can just say:
> > > > > >
> > > > > > if (w->proc && w->proc->pid == pid) { leader_pid =
> > > > > > w->leader_pid; break; }
> > > > > >
> > > > >
> > > > > Yeah, this should also work but I feel the current one is explicit
> > > > > and more clear.
> > > >
> > > > OK.
> > > >
> > > > But, I have one last comment about this function -- I saw there are
> > > > already other functions that iterate max_logical_replication_workers
> > > > like this looking for things:
> > > > - logicalrep_worker_find
> > > > - logicalrep_workers_find
> > > > - logicalrep_worker_launch
> > > > - logicalrep_sync_worker_count
> > > >
> > > > So I felt this new function (currently called
> > > > GetLeaderApplyWorkerPid) ought to be named similarly to those ones.
> > > > e.g. call it something like "logicalrep_worker_find_pa_leader_pid".
> > > >
> > >
> > > I am not sure we can use the name, because currently all the API name
> > > in launcher that used by other module(not related to subscription) are
> > > like AxxBxx style(see the functions in logicallauncher.h).
> > > logicalrep_worker_xxx style functions are currently only declared in
> > > worker_internal.h.
> > >
> >
> > OK. I didn't know there was another header convention that you were following.
> > In that case, it is fine to leave the name as-is.
>
> Thanks for confirming!
>
> Attach the new version 0001 patch which addressed all other comments.
>
> Best regards,
> Hou zj

Hello Hou-san,

1. Do we need to extend test-cases to review the leader_pid column in
pg_stats tables?
2. Do we need to follow the naming convention for
'GetLeaderApplyWorkerPid' like other functions in the same file which
starts with 'logicalrep_'

thanks
Shveta

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

17 January 2023, 04:55:22

On Tue, Jan 17, 2023 at 8:59 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jan 17, 2023 at 8:35 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Jan 16, 2023 at 3:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > Okay, I have added the comments in get_transaction_apply_action() and
> > > updated the comments to refer to the enum TransApplyAction where all
> > > the actions are explained.
> >
> > Thank you for the patch.
> >
> > @@ -1710,6 +1712,7 @@ apply_handle_stream_stop(StringInfo s)
> >         }
> >
> >         in_streamed_transaction = false;
> > +       stream_xid = InvalidTransactionId;
> >
> > We reset stream_xid also in stream_close_file() but probably it's no
> > longer necessary?
> >
>
> I think so.
>
> > How about adding an assertion in apply_handle_stream_start() to make
> > sure the stream_xid is invalid?
> >
>
> I think it would be better to add such an assert in
> apply_handle_begin/apply_handle_begin_prepare because there won't be a
> problem if we start_stream message even when stream_xid is valid.
> However, maybe it is better to add in all three functions
> (apply_handle_begin/apply_handle_begin_prepare/apply_handle_stream_start).
> What do you think?
>
> > ---
> > It's not related to this issue but I realized that if the action
> > returned by get_transaction_apply_action() is not handled in the
> > switch statement, we do only Assert(false). Is it better to raise an
> > error like "unexpected apply action %d" just in case in order to
> > detect failure cases also in the production environment?
> >
>
> Yeah, that may be better. Shall we do that as part of this patch only
> or as a separate patch?
>

Please find attached the updated patches to address the above
comments. I think we can combine and commit them as one patch as both
are related.

-- 
With Regards,
Amit Kapila.

On Tuesday, January 17, 2023 2:46 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> 
> On Tue, Jan 17, 2023 at 12:37 PM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> > Attach the new version 0001 patch which addressed all other comments.
> >
> 
> Thank you for updating the patch. Here is one comment:
> 
> @@ -426,14 +427,24 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
> 
>                                 /*
>                                  * Show the leader only for active parallel
> workers.  This
> -                                * leaves the field as NULL for the
> leader of a parallel
> -                                * group.
> +                                * leaves the field as NULL for the
> leader of a parallel group
> +                                * or the leader of parallel apply workers.
>                                  */
>                                 if (leader && leader->pid !=
> beentry->st_procpid)
>                                 {
>                                         values[28] =
> Int32GetDatum(leader->pid);
>                                         nulls[28] = false;
>                                 }
> +                               else
> +                               {
> +                                       int
> leader_pid = GetLeaderApplyWorkerPid(beentry->st_procpid);
> +
> +                                       if (leader_pid != InvalidPid)
> +                                       {
> +                                               values[28] =
> Int32GetDatum(leader_pid);
> +                                               nulls[28] = false;
> +                                       }
> +                               }
>                         }
> 
> I'm slightly concerned that there could be overhead of executing
> GetLeaderApplyWorkerPid () for every backend process except for parallel
> query workers. The number of such backends could be large and
> GetLeaderApplyWorkerPid() acquires the lwlock. For example, does it make
> sense to check (st_backendType == B_BG_WORKER) before calling
> GetLeaderApplyWorkerPid()? Or it might not be a problem since it's
> LogicalRepWorkerLock which is not likely to be contended.

Thanks for the comment and I think your suggestion makes sense.
I have added the check before getting the leader pid. Here is the new version patch.

Best regards,
Hou zj

Attachment

v83-0001-Display-the-leader-apply-worker-s-PID-for-parall.patch

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

17 January 2023, 09:15:06

On Tuesday, January 17, 2023 12:34 PM shveta malik <shveta.malik@gmail.com> wrote:
> 
> On Tue, Jan 17, 2023 at 9:07 AM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Tuesday, January 17, 2023 11:32 AM Peter Smith
> <smithpb2250@gmail.com> wrote:
> > > OK. I didn't know there was another header convention that you were
> > > following.
> > > In that case, it is fine to leave the name as-is.
> >
> > Thanks for confirming!
> >
> > Attach the new version 0001 patch which addressed all other comments.
> >
> > Best regards,
> > Hou zj
> 
> Hello Hou-san,
> 
> 1. Do we need to extend test-cases to review the leader_pid column in pg_stats
> tables?

Thanks for the comments.

We currently don't have any tests for the view, so I feel we can extend
them later as a separate patch.

> 2. Do we need to follow the naming convention for
> 'GetLeaderApplyWorkerPid' like other functions in the same file which starts
> with 'logicalrep_'

We have agreed [1] to follow the naming convention for functions in logicallauncher.h
which are mainly used for other modules.

[1] https://www.postgresql.org/message-id/CAHut%2BPtgj%3DDY8F1cMBRUxsZtq2-faW%3D%3D5-dSuHSPJGx1a_vBFQ%40mail.gmail.com

Best regards,
Hou zj

Re: Perform streaming logical transactions by background workers and parallel apply

From

Masahiko Sawada

Date:

17 January 2023, 14:37:02

On Tue, Jan 17, 2023 at 6:14 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, January 17, 2023 2:46 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, Jan 17, 2023 at 12:37 PM houzj.fnst@fujitsu.com
> > <houzj.fnst@fujitsu.com> wrote:
> > > Attach the new version 0001 patch which addressed all other comments.
> > >
> >
> > Thank you for updating the patch. Here is one comment:
> >
> > @@ -426,14 +427,24 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
> >
> >                                 /*
> >                                  * Show the leader only for active parallel
> > workers.  This
> > -                                * leaves the field as NULL for the
> > leader of a parallel
> > -                                * group.
> > +                                * leaves the field as NULL for the
> > leader of a parallel group
> > +                                * or the leader of parallel apply workers.
> >                                  */
> >                                 if (leader && leader->pid !=
> > beentry->st_procpid)
> >                                 {
> >                                         values[28] =
> > Int32GetDatum(leader->pid);
> >                                         nulls[28] = false;
> >                                 }
> > +                               else
> > +                               {
> > +                                       int
> > leader_pid = GetLeaderApplyWorkerPid(beentry->st_procpid);
> > +
> > +                                       if (leader_pid != InvalidPid)
> > +                                       {
> > +                                               values[28] =
> > Int32GetDatum(leader_pid);
> > +                                               nulls[28] = false;
> > +                                       }
> > +                               }
> >                         }
> >
> > I'm slightly concerned that there could be overhead of executing
> > GetLeaderApplyWorkerPid () for every backend process except for parallel
> > query workers. The number of such backends could be large and
> > GetLeaderApplyWorkerPid() acquires the lwlock. For example, does it make
> > sense to check (st_backendType == B_BG_WORKER) before calling
> > GetLeaderApplyWorkerPid()? Or it might not be a problem since it's
> > LogicalRepWorkerLock which is not likely to be contended.
>
> Thanks for the comment and I think your suggestion makes sense.
> I have added the check before getting the leader pid. Here is the new version patch.

Thank you for updating the patch. Looks good to me.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

18 January 2023, 04:35:58

On Tue, Jan 17, 2023 at 8:07 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Jan 17, 2023 at 6:14 PM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Tuesday, January 17, 2023 2:46 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Tue, Jan 17, 2023 at 12:37 PM houzj.fnst@fujitsu.com
> > > <houzj.fnst@fujitsu.com> wrote:
> > > I'm slightly concerned that there could be overhead of executing
> > > GetLeaderApplyWorkerPid () for every backend process except for parallel
> > > query workers. The number of such backends could be large and
> > > GetLeaderApplyWorkerPid() acquires the lwlock. For example, does it make
> > > sense to check (st_backendType == B_BG_WORKER) before calling
> > > GetLeaderApplyWorkerPid()? Or it might not be a problem since it's
> > > LogicalRepWorkerLock which is not likely to be contended.
> >
> > Thanks for the comment and I think your suggestion makes sense.
> > I have added the check before getting the leader pid. Here is the new version patch.
>
> Thank you for updating the patch. Looks good to me.
>

Pushed.

-- 
With Regards,
Amit Kapila.

RE: Perform streaming logical transactions by background workers and parallel apply

From

"wangw.fnst@fujitsu.com"

Date:

18 January 2023, 04:48:01

On Wed, Jan 18, 2023 12:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Jan 17, 2023 at 8:07 PM Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
> >
> > On Tue, Jan 17, 2023 at 6:14 PM houzj.fnst@fujitsu.com
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > On Tuesday, January 17, 2023 2:46 PM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> > > >
> > > > On Tue, Jan 17, 2023 at 12:37 PM houzj.fnst@fujitsu.com
> > > > <houzj.fnst@fujitsu.com> wrote:
> > > > I'm slightly concerned that there could be overhead of executing
> > > > GetLeaderApplyWorkerPid () for every backend process except for parallel
> > > > query workers. The number of such backends could be large and
> > > > GetLeaderApplyWorkerPid() acquires the lwlock. For example, does it
> make
> > > > sense to check (st_backendType == B_BG_WORKER) before calling
> > > > GetLeaderApplyWorkerPid()? Or it might not be a problem since it's
> > > > LogicalRepWorkerLock which is not likely to be contended.
> > >
> > > Thanks for the comment and I think your suggestion makes sense.
> > > I have added the check before getting the leader pid. Here is the new
> version patch.
> >
> > Thank you for updating the patch. Looks good to me.
> >
> 
> Pushed.

Rebased and attach remaining patches for reviewing.

Regards,
Wang Wei

On Monday, January 23, 2023 11:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Fri, Jan 20, 2023 at 11:48 AM Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
> >
> > >
> > > Yet another way is to use the existing parameter logical_decode_mode
> > > [1]. If the value of logical_decoding_mode is 'immediate', then we
> > > can immediately switch to partial serialize mode. This will
> > > eliminate the dependency on timing. The one argument against using
> > > this is that it won't be as clear as a separate parameter like
> > > 'stream_serialize_threshold' proposed by the patch but OTOH we
> > > already have a few parameters that serve a different purpose when
> > > used on the subscriber. For example, 'max_replication_slots' is used
> > > to define the maximum number of replication slots on the publisher
> > > and the maximum number of origins on subscribers. Similarly,
> > > wal_retrieve_retry_interval' is used for different purposes on
> > > subscriber and standby nodes.
> >
> > Using the existing parameter makes sense to me. But if we use
> > logical_decoding_mode also on the subscriber, as Shveta Malik also
> > suggested, probably it's better to rename it so as not to confuse. For
> > example, logical_replication_mode or something.
> >
> 
> +1. Among the options discussed, this sounds better.

OK, here is patch set which does the same.
The first patch set only renames the GUC name, and the second patch uses
the GUC to test the partial serialization.

Best Regards,
Hou zj

On Tuesday, January 24, 2023 3:19 PM Peter Smith <smithpb2250@gmail.com> wrote:
> 
> Here are some review comments for v86-0002
> 
> ======
> Commit message
> 
> 1.
> Use the use the existing developer option logical_replication_mode to test the
> parallel apply of large transaction on subscriber.
> 
> ~
> 
> Typo “Use the use the”
> 
> SUGGESTION (rewritten)
> Give additional functionality to the existing developer option
> 'logical_replication_mode' to help test parallel apply of large transactions on the
> subscriber.

Changed.

> ~~~
> 
> 2.
> Maybe that commit message should also say extra TAP tests that have been
> added to exercise the serialization part of the parallel apply?

Added.

> BTW – I can see the TAP tests are testing full serialization (when the GUC is
> 'immediate') but I not sure how is "partial" serialization (when it has to switch
> halfway from shmem to files) being tested.

The new tests are intended to test most of new code patch for partial
serialization by doing it from the beginning. Later, if required, we can add
different tests for it.

> 
> ======
> doc/src/sgml/config.sgml
> 
> 3.
> Allows streaming or serializing changes immediately in logical decoding. The
> allowed values of logical_replication_mode are buffered and immediate. When
> set to immediate, stream each change if streaming option (see optional
> parameters set by CREATE SUBSCRIPTION) is enabled, otherwise, serialize each
> change. When set to buffered, which is the default, decoding will stream or
> serialize changes when logical_decoding_work_mem is reached.
> On the subscriber side, if streaming option is set to parallel, this parameter also
> allows the leader apply worker to send changes to the shared memory queue or
> to serialize changes. When set to buffered, the leader sends changes to parallel
> apply workers via shared memory queue. When set to immediate, the leader
> serializes all changes to files and notifies the parallel apply workers to read and
> apply them at the end of the transaction.
> 
> ~
> 
> Because now this same developer GUC affects both the publisher side and the
> subscriber side differently IMO this whole description should be re-structured
> accordingly.
> 
> SUGGESTION (something like)
> 
> The allowed values of logical_replication_mode are buffered and immediate. The
> default is buffered.
> 
> On the publisher side, ...
> 
> On the subscriber side, ...

Changed.

> 
> ~~~
> 
> 4.
> This parameter is intended to be used to test logical decoding and replication of
> large transactions for which otherwise we need to generate the changes till
> logical_decoding_work_mem is reached.
> 
> ~
> 
> Maybe this paragraph needs rewording or moving. e.g. Isn't that misleading
> now? Although this might be an explanation for the publisher side, it does not
> seem relevant to the subscriber side's behaviour.

Adjusted the description here.

> 
> ======
> .../replication/logical/applyparallelworker.c
> 
> 5.
> @ -1149,6 +1149,9 @@ pa_send_data(ParallelApplyWorkerInfo *winfo, Size
> nbytes, const void *data)
>   Assert(!IsTransactionState());
>   Assert(!winfo->serialize_changes);
> 
> + if (logical_replication_mode == LOGICAL_REP_MODE_IMMEDIATE) return
> + false;
> +
> 
> I felt that code should have some comment, even if it is just something quite
> basic like "/* For developer testing */"

Added.

> 
> ======
> .../t/018_stream_subxact_abort.pl
> 
> 6.
> +# Clean up test data from the environment.
> +$node_publisher->safe_psql('postgres', "TRUNCATE TABLE test_tab_2");
> +$node_publisher->wait_for_catchup($appname);
> 
> Is it necessary to TRUNCATE the table here? If everything is working shouldn't
> the data be rolled back anyway?

I think it's unnecessary, so removed.

> 
> ~~~
> 
> 7.
> +$node_publisher->safe_psql(
> + 'postgres', q{
> + BEGIN;
> + INSERT INTO test_tab_2 values(1);
> + SAVEPOINT sp;
> + INSERT INTO test_tab_2 values(1);
> + ROLLBACK TO sp;
> + COMMIT;
> + });
> 
> Perhaps this should insert 2 different values so then the verification code can
> check the correct value remains instead of just checking COUNT(*)?

I think testing the count should be ok as the nearby testcases are
also checking the count.

Best regards,
Hou zj

On Tuesday, January 24, 2023 8:47 PM Hou, Zhijie wrote:
> 
> On Tuesday, January 24, 2023 3:19 PM Peter Smith <smithpb2250@gmail.com>
> wrote:
> >
> > Here are some review comments for v86-0002
> >

Sorry, the patch set was somehow attached twice. Here is the correct new version
patch set which addressed all comments so far.

Best Regards,
Hou zj

On Wed, Jan 25, 2023 at 10:05 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jan 25, 2023 at 3:15 AM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > 1.
> > @@ -210,7 +210,7 @@ int logical_decoding_work_mem;
> >  static const Size max_changes_in_memory = 4096; /* XXX for restore only */
> >
> >  /* GUC variable */
> > -int logical_decoding_mode = LOGICAL_DECODING_MODE_BUFFERED;
> > +int logical_replication_mode = LOGICAL_REP_MODE_BUFFERED;
> >
> >
> > I noticed that the comment /* GUC variable */ is currently only above
> > the logical_replication_mode, but actually logical_decoding_work_mem
> > is a GUC variable too. Maybe this should be rearranged somehow then
> > change the comment "GUC variable" -> "GUC variables"?
> >
>
> I think moving these variables together doesn't sound like a good idea
> because logical_decoding_work_mem variable is defined with other
> related variable. Also, if we are doing the last comment, I think that
> will obviate the need for this.
>
> > ======
> > src/backend/utils/misc/guc_tables.c
> >
> > @@ -4908,13 +4908,13 @@ struct config_enum ConfigureNamesEnum[] =
> >   },
> >
> >   {
> > - {"logical_decoding_mode", PGC_USERSET, DEVELOPER_OPTIONS,
> > + {"logical_replication_mode", PGC_USERSET, DEVELOPER_OPTIONS,
> >   gettext_noop("Allows streaming or serializing each change in logical
> > decoding."),
> >   NULL,
> >   GUC_NOT_IN_SAMPLE
> >   },
> > - &logical_decoding_mode,
> > - LOGICAL_DECODING_MODE_BUFFERED, logical_decoding_mode_options,
> > + &logical_replication_mode,
> > + LOGICAL_REP_MODE_BUFFERED, logical_replication_mode_options,
> >   NULL, NULL, NULL
> >   },
> >
> > That gettext_noop string seems incorrect. I think Kuroda-san
> > previously reported the same, but then you replied it has been fixed
> > already [1]
> >
> > > I felt the description seems not to be suitable for current behavior.
> > > A short description should be like "Sets a behavior of logical replication", and
> > > further descriptions can be added in lond description.
> > I adjusted the description here.
> >
> > But this doesn't look fixed to me. (??)
> >
>
> Okay, so, how about the following for the 0001 patch:
> short desc: Controls when to replicate each change.
> long desc: On the publisher, it allows streaming or serializing each
> change in logical decoding.
>

I have updated the patch accordingly and it looks good to me. I'll
push this first patch early next week (Monday) unless there are more
comments.

-- 
With Regards,
Amit Kapila.

Attachment

v88-0001-Rename-GUC-logical_decoding_mode-to-logical_repl.patch

RE: Perform streaming logical transactions by background workers and parallel apply

From

"Hayato Kuroda (Fujitsu)"

Date:

25 January 2023, 06:51:26

Dear Amit,

> 
> I have updated the patch accordingly and it looks good to me. I'll
> push this first patch early next week (Monday) unless there are more
> comments.

Thanks for updating. I checked v88-0001 and I have no objection. LGTM.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

25 January 2023, 14:24:50

On Wednesday, January 25, 2023 7:30 AM Peter Smith <smithpb2250@gmail.com> wrote:
> 
> Here are my review comments for patch v87-0002.

Thanks for your comments.

> ======
> doc/src/sgml/config.sgml
> 
> 1.
>         <para>
> -        Allows streaming or serializing changes immediately in
> logical decoding.
>          The allowed values of
> <varname>logical_replication_mode</varname> are
> -        <literal>buffered</literal> and <literal>immediate</literal>. When
> set
> -        to <literal>immediate</literal>, stream each change if
> +        <literal>buffered</literal> and <literal>immediate</literal>.
> The default
> +        is <literal>buffered</literal>.
> +       </para>
> 
> I didn't think it was necessary to say “of logical_replication_mode”.
> IMO that much is already obvious because this is the first sentence of the
> description for logical_replication_mode.
> 

Changed.

> ~~~
> 
> 2.
> +       <para>
> +        On the publisher side, it allows streaming or serializing changes
> +        immediately in logical decoding.  When set to
> +        <literal>immediate</literal>, stream each change if
>          <literal>streaming</literal> option (see optional parameters set by
>          <link linkend="sql-createsubscription"><command>CREATE
> SUBSCRIPTION</command></link>)
>          is enabled, otherwise, serialize each change.  When set to
> -        <literal>buffered</literal>, which is the default, decoding will stream
> -        or serialize changes when
> <varname>logical_decoding_work_mem</varname>
> -        is reached.
> +        <literal>buffered</literal>, decoding will stream or serialize changes
> +        when <varname>logical_decoding_work_mem</varname> is
> reached.
>         </para>
> 
> 2a.
> "it allows" --> "logical_replication_mode allows"
> 
> 2b.
> "decoding" --> "the decoding"

Changed.

> ~~~
> 
> 3.
> +       <para>
> +        On the subscriber side, if <literal>streaming</literal> option is set
> +        to <literal>parallel</literal>, this parameter also allows the leader
> +        apply worker to send changes to the shared memory queue or to
> serialize
> +        changes.  When set to <literal>buffered</literal>, the leader sends
> +        changes to parallel apply workers via shared memory queue.  When
> set to
> +        <literal>immediate</literal>, the leader serializes all changes to
> +        files and notifies the parallel apply workers to read and apply them at
> +        the end of the transaction.
> +       </para>
> 
> "this parameter also allows" --> "logical_replication_mode also allows"

Changed.

> ~~~
> 
> 4.
>         <para>
>          This parameter is intended to be used to test logical decoding and
>          replication of large transactions for which otherwise we need to
>          generate the changes till
> <varname>logical_decoding_work_mem</varname>
> -        is reached.
> +        is reached. Moreover, this can also be used to test the transmission of
> +        changes between the leader and parallel apply workers.
>         </para>
> 
> "Moreover, this can also" --> "It can also"
> 
> I am wondering would this sentence be better put at the top of the GUC
> description. So then the first paragraph becomes like this:
> 
> 
> SUGGESTION (I've also added another sentence "The effect of...")
> 
> The allowed values are buffered and immediate. The default is buffered. This
> parameter is intended to be used to test logical decoding and replication of large
> transactions for which otherwise we need to generate the changes till
> logical_decoding_work_mem is reached. It can also be used to test the
> transmission of changes between the leader and parallel apply workers. The
> effect of logical_replication_mode is different for the publisher and
> subscriber:
> 
> On the publisher side...
> 
> On the subscriber side...

I think your suggestion makes sense, so changed as suggested.

> ======
> .../replication/logical/applyparallelworker.c
> 
> 5.
> + /*
> + * In immeidate mode, directly return false so that we can switch to
> + * PARTIAL_SERIALIZE mode and serialize remaining changes to files.
> + */
> + if (logical_replication_mode == LOGICAL_REP_MODE_IMMEDIATE) return
> + false;
> 
> Typo "immediate"
> 
> Also, I felt "directly" is not needed. "return false" and "directly return false" is the
> same.
> 
> SUGGESTION
> Using ‘immediate’ mode returns false to cause a switch to PARTIAL_SERIALIZE
> mode so that the remaining changes will be serialized.

Changed.

> ======
> src/backend/utils/misc/guc_tables.c
> 
> 6.
>   {
>   {"logical_replication_mode", PGC_USERSET, DEVELOPER_OPTIONS,
> - gettext_noop("Allows streaming or serializing each change in logical
> decoding."),
> - NULL,
> + gettext_noop("Controls the behavior of logical replication publisher
> and subscriber"),
> + gettext_noop("If set to immediate, on the publisher side, it "
> + "allows streaming or serializing each change in "
> + "logical decoding. On the subscriber side, in "
> + "parallel streaming mode, it allows the leader apply "
> + "worker to serialize changes to files and notifies "
> + "the parallel apply workers to read and apply them at "
> + "the end of the transaction."),
>   GUC_NOT_IN_SAMPLE
>   },
> 
> 6a. short description
> 
> User PoV behaviour should be the same. Instead, maybe say "controls the
> internal behavior" or something like that?

Changed to "internal behavior xxx"

> ~
> 
> 6b. long description
> 
> IMO the long description shouldn’t mention ‘immediate’ mode first as it does.
> 
> BEFORE
> If set to immediate, on the publisher side, ...
> 
> AFTER
> On the publisher side, ...

Changed.

Attach the new version patch set.
The 0001 patch is the same as the v88-0001 posted by Amit[1],
attach it here to make cfbot happy.

[1] https://www.postgresql.org/message-id/CAA4eK1JpWoaB63YULpQa1KDw_zBW-QoRMuNxuiP1KafPJzuVuw%40mail.gmail.com

Best Regards,
Hou zj

On Monday, January 30, 2023 12:13 PM Peter Smith <smithpb2250@gmail.com> wrote:
> 
> Here are my review comments for v88-0002.

Thanks for your comments.

> 
> ======
> General
> 
> 1.
> The test cases are checking the log content but they are not checking for
> debug logs or untranslated elogs -- they are expecting a normal ereport LOG
> that might be translated. I’m not sure if that is OK, or if it is a potential problem.

We have tests that check the ereport ERROR and ereport WARNING message(by
search for the ERROR or WARNING keyword for all the tap tests), so I think
checking the LOG should be fine.

> ======
> doc/src/sgml/config.sgml
> 
> 2.
> On the publisher side, logical_replication_mode allows allows streaming or
> serializing changes immediately in logical decoding. When set to immediate,
> stream each change if streaming option (see optional parameters set by
> CREATE SUBSCRIPTION) is enabled, otherwise, serialize each change. When set
> to buffered, the decoding will stream or serialize changes when
> logical_decoding_work_mem is reached.
> 
> 2a.
> typo "allows allows"  (Kuroda-san reported same)
> 
> 2b.
> "if streaming option" --> "if the streaming option"

Changed.

> ~~~
> 
> 3.
> On the subscriber side, if streaming option is set to parallel,
> logical_replication_mode also allows the leader apply worker to send changes
> to the shared memory queue or to serialize changes.
> 
> SUGGESTION
> On the subscriber side, if the streaming option is set to parallel,
> logical_replication_mode can be used to direct the leader apply worker to
> send changes to the shared memory queue or to serialize changes.

Changed.

> ======
> src/backend/utils/misc/guc_tables.c
> 
> 4.
>   {
>   {"logical_replication_mode", PGC_USERSET, DEVELOPER_OPTIONS,
> - gettext_noop("Controls when to replicate each change."),
> - gettext_noop("On the publisher, it allows streaming or serializing each
> change in logical decoding."),
> + gettext_noop("Controls the internal behavior of logical replication
> publisher and subscriber"),
> + gettext_noop("On the publisher, it allows streaming or "
> + "serializing each change in logical decoding. On the "
> + "subscriber, in parallel streaming mode, it allows "
> + "the leader apply worker to serialize changes to "
> + "files and notifies the parallel apply workers to "
> + "read and apply them at the end of the transaction."),
>   GUC_NOT_IN_SAMPLE
>   },
> Suggest re-wording the long description (subscriber part) to be more like the
> documentation text.
> 
> BEFORE
> On the subscriber, in parallel streaming mode, it allows the leader apply worker
> to serialize changes to files and notifies the parallel apply workers to read and
> apply them at the end of the transaction.
> 
> SUGGESTION
> On the subscriber, if the streaming option is set to parallel, it directs the leader
> apply worker to send changes to the shared memory queue or to serialize
> changes and apply them at the end of the transaction.
> 

Changed.

Attach the new version patch which addressed all comments so far (the v88-0001
has been committed, so we only have one remaining patch this time).

Best Regards,
Hou zj

Attachment

v89-0001-Extend-the-logical_replication_mode-to-test-the-.patch

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

30 January 2023, 06:24:36

On Thursday, January 26, 2023 11:37 AM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote:
> 
> Followings are comments.

Thanks for the comments.

> In this test the rollback-prepared seems not to be executed. This is because
> serializations are finished while handling PREPARE message and the final
> state of transaction does not affect that, right? I think it may be helpful
> to add a one line comment.

Yes, but I am slightly unsure if it would be helpful to add this as we only test basic
cases(mainly for code coverage) for partial serialization.

> 
> 1. config.sgml
> 
> ```
> +        the changes till logical_decoding_work_mem is reached. It can also
> be
> ```
> 
> I think it should be sandwiched by <varname>.

Added.

> 
> 2. config.sgml
> 
> ```
> +        On the publisher side,
> <varname>logical_replication_mode</varname> allows
> +        allows streaming or serializing changes immediately in logical
> decoding.
> ```
> 
> Typo "allows allows" -> "allows"

Fixed.

> 3. test general
> 
> You confirmed that the leader started to serialize changes, but did not ensure
> the endpoint.
> IIUC the parallel apply worker exits after applying serialized changes, and it is
> not tested yet.
> Can we add polling the log somewhere?

I checked other tests and didn't find some examples where we test the exit of
apply worker or table sync worker. And if the parallel apply worker doesn't stop in
this case, we will fail anyway when reusing this worker to handle the next
transaction because the queue is broken. So, I prefer to keep the tests short.

> 4. 015_stream.pl
> 
> ```
> +is($result, qq(15000), 'all changes are replayed from file')
> ```
> 
> The statement may be unclear because changes can be also replicated when
> streaming = on.
> How about: "parallel apply worker replayed all changes from file"?

Changed.

Best regards,
Hou zj

RE: Perform streaming logical transactions by background workers and parallel apply

From

"Hayato Kuroda (Fujitsu)"

Date:

30 January 2023, 12:57:34

Dear Hou,

Thank you for updating the patch!
I checked your replies and new patch, and it seems good.
Currently I have no comments

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Perform streaming logical transactions by background workers and parallel apply

From

Masahiko Sawada

Date:

30 January 2023, 14:19:46

On Mon, Jan 30, 2023 at 3:23 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> On Monday, January 30, 2023 12:13 PM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > Here are my review comments for v88-0002.
>
> Thanks for your comments.
>
> >
> > ======
> > General
> >
> > 1.
> > The test cases are checking the log content but they are not checking for
> > debug logs or untranslated elogs -- they are expecting a normal ereport LOG
> > that might be translated. I’m not sure if that is OK, or if it is a potential problem.
>
> We have tests that check the ereport ERROR and ereport WARNING message(by
> search for the ERROR or WARNING keyword for all the tap tests), so I think
> checking the LOG should be fine.
>
> > ======
> > doc/src/sgml/config.sgml
> >
> > 2.
> > On the publisher side, logical_replication_mode allows allows streaming or
> > serializing changes immediately in logical decoding. When set to immediate,
> > stream each change if streaming option (see optional parameters set by
> > CREATE SUBSCRIPTION) is enabled, otherwise, serialize each change. When set
> > to buffered, the decoding will stream or serialize changes when
> > logical_decoding_work_mem is reached.
> >
> > 2a.
> > typo "allows allows"  (Kuroda-san reported same)
> >
> > 2b.
> > "if streaming option" --> "if the streaming option"
>
> Changed.
>
> > ~~~
> >
> > 3.
> > On the subscriber side, if streaming option is set to parallel,
> > logical_replication_mode also allows the leader apply worker to send changes
> > to the shared memory queue or to serialize changes.
> >
> > SUGGESTION
> > On the subscriber side, if the streaming option is set to parallel,
> > logical_replication_mode can be used to direct the leader apply worker to
> > send changes to the shared memory queue or to serialize changes.
>
> Changed.
>
> > ======
> > src/backend/utils/misc/guc_tables.c
> >
> > 4.
> >   {
> >   {"logical_replication_mode", PGC_USERSET, DEVELOPER_OPTIONS,
> > - gettext_noop("Controls when to replicate each change."),
> > - gettext_noop("On the publisher, it allows streaming or serializing each
> > change in logical decoding."),
> > + gettext_noop("Controls the internal behavior of logical replication
> > publisher and subscriber"),
> > + gettext_noop("On the publisher, it allows streaming or "
> > + "serializing each change in logical decoding. On the "
> > + "subscriber, in parallel streaming mode, it allows "
> > + "the leader apply worker to serialize changes to "
> > + "files and notifies the parallel apply workers to "
> > + "read and apply them at the end of the transaction."),
> >   GUC_NOT_IN_SAMPLE
> >   },
> > Suggest re-wording the long description (subscriber part) to be more like the
> > documentation text.
> >
> > BEFORE
> > On the subscriber, in parallel streaming mode, it allows the leader apply worker
> > to serialize changes to files and notifies the parallel apply workers to read and
> > apply them at the end of the transaction.
> >
> > SUGGESTION
> > On the subscriber, if the streaming option is set to parallel, it directs the leader
> > apply worker to send changes to the shared memory queue or to serialize
> > changes and apply them at the end of the transaction.
> >
>
> Changed.
>
> Attach the new version patch which addressed all comments so far (the v88-0001
> has been committed, so we only have one remaining patch this time).
>

I have one comment on v89 patch:

+       /*
+        * Using 'immediate' mode returns false to cause a switch to
+        * PARTIAL_SERIALIZE mode so that the remaining changes will
be serialized.
+        */
+       if (logical_replication_mode == LOGICAL_REP_MODE_IMMEDIATE)
+               return false;
+

Probably we might want to add unlikely() here since we could pass
through this path very frequently?

The rest looks good to me.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Perform streaming logical transactions by background workers and parallel apply

From

Peter Smith

Date:

31 January 2023, 00:22:57

On Mon, Jan 30, 2023 at 5:23 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> On Monday, January 30, 2023 12:13 PM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > Here are my review comments for v88-0002.
>
> Thanks for your comments.
>
> >
> > ======
> > General
> >
> > 1.
> > The test cases are checking the log content but they are not checking for
> > debug logs or untranslated elogs -- they are expecting a normal ereport LOG
> > that might be translated. I’m not sure if that is OK, or if it is a potential problem.
>
> We have tests that check the ereport ERROR and ereport WARNING message(by
> search for the ERROR or WARNING keyword for all the tap tests), so I think
> checking the LOG should be fine.
>
> > ======
> > doc/src/sgml/config.sgml
> >
> > 2.
> > On the publisher side, logical_replication_mode allows allows streaming or
> > serializing changes immediately in logical decoding. When set to immediate,
> > stream each change if streaming option (see optional parameters set by
> > CREATE SUBSCRIPTION) is enabled, otherwise, serialize each change. When set
> > to buffered, the decoding will stream or serialize changes when
> > logical_decoding_work_mem is reached.
> >
> > 2a.
> > typo "allows allows"  (Kuroda-san reported same)
> >
> > 2b.
> > "if streaming option" --> "if the streaming option"
>
> Changed.

Although you replied "Changed" for the above, AFAICT my review comment
#2b. was accidentally missed.

Otherwise, the patch LGTM.

------
Kind Regards,
Peter Smith.
Fujitsu Australia

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

31 January 2023, 03:34:37

On Monday, January 30, 2023 10:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> 
> 
> I have one comment on v89 patch:
> 
> +       /*
> +        * Using 'immediate' mode returns false to cause a switch to
> +        * PARTIAL_SERIALIZE mode so that the remaining changes will
> be serialized.
> +        */
> +       if (logical_replication_mode == LOGICAL_REP_MODE_IMMEDIATE)
> +               return false;
> +
> 
> Probably we might want to add unlikely() here since we could pass through this
> path very frequently?

I think your comment makes sense, thanks.
I updated the patch for the same.

Best Regards,
Hou zj

Attachment

v90-0001-Extend-the-logical_replication_mode-to-test-the-.patch

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

31 January 2023, 03:34:45

On Tuesday, January 31, 2023 8:23 AM Peter Smith <smithpb2250@gmail.com> wrote:
> 
> On Mon, Jan 30, 2023 at 5:23 PM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Monday, January 30, 2023 12:13 PM Peter Smith
> <smithpb2250@gmail.com> wrote:
> > >
> > > Here are my review comments for v88-0002.
> >
> > Thanks for your comments.
> >
> > >
> > > ======
> > > General
> > >
> > > 1.
> > > The test cases are checking the log content but they are not
> > > checking for debug logs or untranslated elogs -- they are expecting
> > > a normal ereport LOG that might be translated. I’m not sure if that is OK, or
> if it is a potential problem.
> >
> > We have tests that check the ereport ERROR and ereport WARNING
> > message(by search for the ERROR or WARNING keyword for all the tap
> > tests), so I think checking the LOG should be fine.
> >
> > > ======
> > > doc/src/sgml/config.sgml
> > >
> > > 2.
> > > On the publisher side, logical_replication_mode allows allows
> > > streaming or serializing changes immediately in logical decoding.
> > > When set to immediate, stream each change if streaming option (see
> > > optional parameters set by CREATE SUBSCRIPTION) is enabled,
> > > otherwise, serialize each change. When set to buffered, the decoding
> > > will stream or serialize changes when logical_decoding_work_mem is
> reached.
> > >
> > > 2a.
> > > typo "allows allows"  (Kuroda-san reported same)
> > >
> > > 2b.
> > > "if streaming option" --> "if the streaming option"
> >
> > Changed.
> 
> Although you replied "Changed" for the above, AFAICT my review comment
> #2b. was accidentally missed.

Fixed.

Best Regards,
Hou zj

Re: Perform streaming logical transactions by background workers and parallel apply

From

Peter Smith

Date:

31 January 2023, 05:40:30

Thanks for the updates to address all of my previous review comments.

Patch v90-0001 LGTM.

------
Kind Reagrds,
Peter Smith.
Fujitsu Australia

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

01 February 2023, 12:00:23

On Tue, Jan 31, 2023 at 9:04 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> I think your comment makes sense, thanks.
> I updated the patch for the same.
>

The patch looks mostly good to me. I have made a few changes in the
comments and docs, see attached.

-- 
With Regards,
Amit Kapila.

Attachment

v91-0001-Allow-the-logical_replication_mode-to-be-used-on.patch

Re: Perform streaming logical transactions by background workers and parallel apply

From

Peter Smith

Date:

01 February 2023, 23:22:06

Some minor review comments for v91-0001

======
doc/src/sgml/config.sgml

1.
        <para>
-        Allows streaming or serializing changes immediately in
logical decoding.
-        The allowed values of <varname>logical_replication_mode</varname> are
-        <literal>buffered</literal> and <literal>immediate</literal>. When set
-        to <literal>immediate</literal>, stream each change if
+        The allowed values are <literal>buffered</literal> and
+        <literal>immediate</literal>. The default is
<literal>buffered</literal>.
+        This parameter is intended to be used to test logical decoding and
+        replication of large transactions for which otherwise we need
to generate
+        the changes till <varname>logical_decoding_work_mem</varname> is
+        reached.  The effect of <varname>logical_replication_mode</varname> is
+        different for the publisher and subscriber:
+       </para>

The "for which otherwise..." part is only relevant for the
publisher-side. So it seemed slightly strange to give the reason why
to use the GUC for one side but not the other side.

Maybe we can just to remove that "for which otherwise..." part, since
the logical_decoding_work_mem gets mentioned later in the "On the
publisher side,..." paragraph anyway.

~~~

2.
        <para>
-        This parameter is intended to be used to test logical decoding and
-        replication of large transactions for which otherwise we need to
-        generate the changes till <varname>logical_decoding_work_mem</varname>
-        is reached.
+        On the subscriber side, if the <literal>streaming</literal>
option is set to
+        <literal>parallel</literal>,
<varname>logical_replication_mode</varname>
+        can be used to direct the leader apply worker to send changes to the
+        shared memory queue or to serialize changes to the file.  When set to
+        <literal>buffered</literal>, the leader sends changes to parallel apply
+        workers via a shared memory queue.  When set to
+        <literal>immediate</literal>, the leader serializes all
changes to files
+        and notifies the parallel apply workers to read and apply them at the
+        end of the transaction.
        </para>

"or serialize changes to the file." --> "or serialize all changes to
files." (just to use same wording as later in this same paragraph, and
also same wording as the GUC hint text).

------
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

03 February 2023, 03:04:04

On Thu, Feb 2, 2023 at 4:52 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> Some minor review comments for v91-0001
>

Pushed this yesterday after addressing your comments!

-- 
With Regards,
Amit Kapila.

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

03 February 2023, 03:29:27

On Friday, February 3, 2023 11:04 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Thu, Feb 2, 2023 at 4:52 AM Peter Smith <smithpb2250@gmail.com>
> wrote:
> >
> > Some minor review comments for v91-0001
> >
> 
> Pushed this yesterday after addressing your comments!

Thanks for pushing.

Currently, we have two remaining patches which we are not sure whether it's worth
committing for now. Just share them here for reference.

0001:

Based on our discussion[1] on -hackers, it's not clear that if it's necessary
to add the sub-feature to stop extra worker when
max_apply_workers_per_suibscription is reduced. Because:

- it's not clear whether reducing the 'max_apply_workers_per_suibscription' is very
  common.
- even when the GUC is reduced, at that point in time all the workers might be
  in use so there may be nothing that can be immediately done.
- IIUC the excess workers (for a reduced GUC) are going to get freed naturally
  anyway over time as more transactions are completed so the pool size will
  reduce accordingly.

And given that the logic of this patch is simple, it would be easy to add this
at a later point if we really see a use case for this.

0002:

Since all the deadlock errors and other errors that caused by parallel streaming
will be logged and user can check this kind of ERROR and disable the parallel
streaming mode to resolve this. Besides, for this retry feature, It will
be hard to distinguish whether the ERROR is caused by parallel streaming, and we
might need to retry in serialize mode for all kinds of ERROR. So, it's not very
clear if automatic use serialize mode to retry in case of any ERROR in parallel
streaming is necessary or not. And we can also add this when we see a use case.

[1] https://www.postgresql.org/message-id/CAA4eK1LotEuPsteuJMNpixxTj6R4B8k93q-6ruRmDzCxKzMNpA%40mail.gmail.com

Best Regards,
Hou zj

Hi, 

while reading the code, I noticed that in pa_send_data() we set wait event
to WAIT_EVENT_LOGICAL_PARALLEL_APPLY_STATE_CHANGE while sending the
message to the queue. Because this state is used in multiple places, user might
not be able to distinguish what they are waiting for. So It seems we'd better
to use WAIT_EVENT_MQ_SEND here which will be eaier to distinguish and
understand. Here is a tiny patch for that.

Best Regards,
Hou zj

Attachment

0001-Use-appropriate-wait-event-when-sending-data.patch

RE: Perform streaming logical transactions by background workers and parallel apply

From

"Hayato Kuroda (Fujitsu)"

Date:

06 February 2023, 10:33:36

Dear Hou,

> while reading the code, I noticed that in pa_send_data() we set wait event
> to WAIT_EVENT_LOGICAL_PARALLEL_APPLY_STATE_CHANGE while sending
> the
> message to the queue. Because this state is used in multiple places, user might
> not be able to distinguish what they are waiting for. So It seems we'd better
> to use WAIT_EVENT_MQ_SEND here which will be eaier to distinguish and
> understand. Here is a tiny patch for that.

In LogicalParallelApplyLoop(), we introduced the new wait event
WAIT_EVENT_LOGICAL_PARALLEL_APPLY_MAIN whereas it is practically waits a shared
message queue and it seems to be same as WAIT_EVENT_MQ_RECEIVE.
Do you have a policy to reuse the event instead of adding a new event?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

06 February 2023, 11:25:03

On Monday, February 6, 2023 6:34 PM Kuroda, Hayato <kuroda.hayato@fujitsu.com> wrote:
> > while reading the code, I noticed that in pa_send_data() we set wait
> > event to WAIT_EVENT_LOGICAL_PARALLEL_APPLY_STATE_CHANGE while
> sending
> > the message to the queue. Because this state is used in multiple
> > places, user might not be able to distinguish what they are waiting
> > for. So It seems we'd better to use WAIT_EVENT_MQ_SEND here which will
> > be eaier to distinguish and understand. Here is a tiny patch for that.
> 
> In LogicalParallelApplyLoop(), we introduced the new wait event
> WAIT_EVENT_LOGICAL_PARALLEL_APPLY_MAIN whereas it is practically waits a
> shared message queue and it seems to be same as WAIT_EVENT_MQ_RECEIVE.
> Do you have a policy to reuse the event instead of adding a new event?

I think PARALLEL_APPLY_MAIN waits for two kinds of event: 1) wait for new
message from the queue 2) wait for the partial file state to be set. So, I
think introducing a new general event for them is better and it is also
consistent with the WAIT_EVENT_LOGICAL_APPLY_MAIN which is used in the main
loop of leader apply worker(LogicalRepApplyLoop). But the event in
pg_send_data() is only for message send, so it seems fine to use
WAIT_EVENT_MQ_SEND, besides MQ_SEND is also unique in parallel apply worker and
user can distinglish without adding new event.

Best Regards,
Hou zj

RE: Perform streaming logical transactions by background workers and parallel apply

From

"Hayato Kuroda (Fujitsu)"

Date:

07 February 2023, 02:03:05

Dear Hou,

> I think PARALLEL_APPLY_MAIN waits for two kinds of event: 1) wait for new
> message from the queue 2) wait for the partial file state to be set. So, I
> think introducing a new general event for them is better and it is also
> consistent with the WAIT_EVENT_LOGICAL_APPLY_MAIN which is used in the
> main
> loop of leader apply worker(LogicalRepApplyLoop). But the event in
> pg_send_data() is only for message send, so it seems fine to use
> WAIT_EVENT_MQ_SEND, besides MQ_SEND is also unique in parallel apply
> worker and
> user can distinglish without adding new event.

Thank you for your explanation. I think both of you said are reasonable.


Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

07 February 2023, 03:16:48

On Mon, Feb 6, 2023 at 3:43 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> while reading the code, I noticed that in pa_send_data() we set wait event
> to WAIT_EVENT_LOGICAL_PARALLEL_APPLY_STATE_CHANGE while sending the
> message to the queue. Because this state is used in multiple places, user might
> not be able to distinguish what they are waiting for. So It seems we'd better
> to use WAIT_EVENT_MQ_SEND here which will be eaier to distinguish and
> understand. Here is a tiny patch for that.
>

Thanks for noticing this. The patch LGTM. I'll push this in some time.

-- 
With Regards,
Amit Kapila.

Re: Perform streaming logical transactions by background workers and parallel apply

From

Masahiko Sawada

Date:

07 February 2023, 07:11:17

On Fri, Feb 3, 2023 at 6:44 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Feb 3, 2023 at 1:28 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Feb 3, 2023 at 12:29 PM houzj.fnst@fujitsu.com
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > On Friday, February 3, 2023 11:04 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Thu, Feb 2, 2023 at 4:52 AM Peter Smith <smithpb2250@gmail.com>
> > > > wrote:
> > > > >
> > > > > Some minor review comments for v91-0001
> > > > >
> > > >
> > > > Pushed this yesterday after addressing your comments!
> > >
> > > Thanks for pushing.
> > >
> > > Currently, we have two remaining patches which we are not sure whether it's worth
> > > committing for now. Just share them here for reference.
> > >
> > > 0001:
> > >
> > > Based on our discussion[1] on -hackers, it's not clear that if it's necessary
> > > to add the sub-feature to stop extra worker when
> > > max_apply_workers_per_suibscription is reduced. Because:
> > >
> > > - it's not clear whether reducing the 'max_apply_workers_per_suibscription' is very
> > >   common.
> >
> > A use case I'm concerned about is a temporarily intensive data load,
> > for example, a data loading batch job in a maintenance window. In this
> > case, the user might want to temporarily increase
> > max_parallel_workers_per_subscription in order to avoid a large
> > replication lag, and revert the change back to normal after the job.
> > If it's unlikely to stream the changes in the regular workload as
> > logical_decoding_work_mem is big enough to handle the regular
> > transaction data, the excess parallel workers won't exit.
> >
>
> Won't in such a case, it would be better to just switch off the
> parallel option for a subscription?

Not sure. Changing the parameter would be easier since it doesn't
require restarts.

> We need to think of a predictable
> way to test this path which may not be difficult. But I guess it would
> be better to wait for some feedback from the field about this feature
> before adding more to it and anyway it shouldn't be a big deal to add
> this later as well.

Agreed to hear some feedback before adding it. It's not an urgent feature.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

07 February 2023, 07:37:14

On Tue, Feb 7, 2023 at 12:41 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Feb 3, 2023 at 6:44 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > We need to think of a predictable
> > way to test this path which may not be difficult. But I guess it would
> > be better to wait for some feedback from the field about this feature
> > before adding more to it and anyway it shouldn't be a big deal to add
> > this later as well.
>
> Agreed to hear some feedback before adding it. It's not an urgent feature.
>

Okay, Thanks! AFAIK, there is no pending patch left in this proposal.
If so, I think it is better to close the corresponding CF entry.

-- 
With Regards,
Amit Kapila.

RE: Perform streaming logical transactions by background workers and parallel apply

From

"wangw.fnst@fujitsu.com"

Date:

08 February 2023, 03:01:03

On Tue, Feb 7, 2023 15:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Feb 7, 2023 at 12:41 PM Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
> >
> > On Fri, Feb 3, 2023 at 6:44 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > > We need to think of a predictable
> > > way to test this path which may not be difficult. But I guess it would
> > > be better to wait for some feedback from the field about this feature
> > > before adding more to it and anyway it shouldn't be a big deal to add
> > > this later as well.
> >
> > Agreed to hear some feedback before adding it. It's not an urgent feature.
> >
> 
> Okay, Thanks! AFAIK, there is no pending patch left in this proposal.
> If so, I think it is better to close the corresponding CF entry.

Yes, I think so.
Closed this CF entry.

Regards,
Wang Wei

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

10 February 2023, 02:32:05

On Tuesday, February 7, 2023 11:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Mon, Feb 6, 2023 at 3:43 PM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> > while reading the code, I noticed that in pa_send_data() we set wait
> > event to WAIT_EVENT_LOGICAL_PARALLEL_APPLY_STATE_CHANGE while
> sending
> > the message to the queue. Because this state is used in multiple
> > places, user might not be able to distinguish what they are waiting
> > for. So It seems we'd better to use WAIT_EVENT_MQ_SEND here which will
> > be eaier to distinguish and understand. Here is a tiny patch for that.
> >

As discussed[1], we'd better invent a new state for this purpose, so here is the patch
that does the same.

[1] https://www.postgresql.org/message-id/CAA4eK1LTud4FLRbS0QqdZ-pjSxwfFLHC1Dx%3D6Q7nyROCvvPSfw%40mail.gmail.com

Best Regards,
Hou zj

Attachment

0001-Add-new-wait-event-to-be-used-in-apply-worker.patch

Re: Perform streaming logical transactions by background workers and parallel apply

From

Peter Smith

Date:

10 February 2023, 03:26:35

On Fri, Feb 10, 2023 at 1:32 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, February 7, 2023 11:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Feb 6, 2023 at 3:43 PM houzj.fnst@fujitsu.com
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > while reading the code, I noticed that in pa_send_data() we set wait
> > > event to WAIT_EVENT_LOGICAL_PARALLEL_APPLY_STATE_CHANGE while
> > sending
> > > the message to the queue. Because this state is used in multiple
> > > places, user might not be able to distinguish what they are waiting
> > > for. So It seems we'd better to use WAIT_EVENT_MQ_SEND here which will
> > > be eaier to distinguish and understand. Here is a tiny patch for that.
> > >
>
> As discussed[1], we'd better invent a new state for this purpose, so here is the patch
> that does the same.
>
> [1] https://www.postgresql.org/message-id/CAA4eK1LTud4FLRbS0QqdZ-pjSxwfFLHC1Dx%3D6Q7nyROCvvPSfw%40mail.gmail.com
>

My first impression was the
WAIT_EVENT_LOGICAL_PARALLEL_APPLY_SEND_DATA name seemed misleading
because that makes it sound like the parallel apply worker is doing
the sending, but IIUC it's really the opposite.

And since WAIT_EVENT_LOGICAL_PARALLEL_APPLY_LEADER_SEND_DATA seems too
verbose, how about shortening the prefix for both events? E.g.

BEFORE
WAIT_EVENT_LOGICAL_PARALLEL_APPLY_SEND_DATA,
WAIT_EVENT_LOGICAL_PARALLEL_APPLY_STATE_CHANGE,

AFTER
WAIT_EVENT_LOGICAL_PA_LEADER_SEND_DATA,
WAIT_EVENT_LOGICAL_PA_STATE_CHANGE,

------
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

14 February 2023, 06:04:04

On Fri, Feb 10, 2023 at 8:56 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Fri, Feb 10, 2023 at 1:32 PM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Tuesday, February 7, 2023 11:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Mon, Feb 6, 2023 at 3:43 PM houzj.fnst@fujitsu.com
> > > <houzj.fnst@fujitsu.com> wrote:
> > > >
> > > > while reading the code, I noticed that in pa_send_data() we set wait
> > > > event to WAIT_EVENT_LOGICAL_PARALLEL_APPLY_STATE_CHANGE while
> > > sending
> > > > the message to the queue. Because this state is used in multiple
> > > > places, user might not be able to distinguish what they are waiting
> > > > for. So It seems we'd better to use WAIT_EVENT_MQ_SEND here which will
> > > > be eaier to distinguish and understand. Here is a tiny patch for that.
> > > >
> >
> > As discussed[1], we'd better invent a new state for this purpose, so here is the patch
> > that does the same.
> >
> > [1] https://www.postgresql.org/message-id/CAA4eK1LTud4FLRbS0QqdZ-pjSxwfFLHC1Dx%3D6Q7nyROCvvPSfw%40mail.gmail.com
> >
>
> My first impression was the
> WAIT_EVENT_LOGICAL_PARALLEL_APPLY_SEND_DATA name seemed misleading
> because that makes it sound like the parallel apply worker is doing
> the sending, but IIUC it's really the opposite.
>

So, how about WAIT_EVENT_LOGICAL_APPLY_SEND_DATA?

> And since WAIT_EVENT_LOGICAL_PARALLEL_APPLY_LEADER_SEND_DATA seems too
> verbose, how about shortening the prefix for both events? E.g.
>
> BEFORE
> WAIT_EVENT_LOGICAL_PARALLEL_APPLY_SEND_DATA,
> WAIT_EVENT_LOGICAL_PARALLEL_APPLY_STATE_CHANGE,
>
> AFTER
> WAIT_EVENT_LOGICAL_PA_LEADER_SEND_DATA,
> WAIT_EVENT_LOGICAL_PA_STATE_CHANGE,
>

I am not sure *_PA_LEADER_* is any better that what Hou-San has proposed.

-- 
With Regards,
Amit Kapila.

Re: Perform streaming logical transactions by background workers and parallel apply

From

Peter Smith

Date:

14 February 2023, 06:58:09

On Tue, Feb 14, 2023 at 5:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Feb 10, 2023 at 8:56 AM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > On Fri, Feb 10, 2023 at 1:32 PM houzj.fnst@fujitsu.com
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > On Tuesday, February 7, 2023 11:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Mon, Feb 6, 2023 at 3:43 PM houzj.fnst@fujitsu.com
> > > > <houzj.fnst@fujitsu.com> wrote:
> > > > >
> > > > > while reading the code, I noticed that in pa_send_data() we set wait
> > > > > event to WAIT_EVENT_LOGICAL_PARALLEL_APPLY_STATE_CHANGE while
> > > > sending
> > > > > the message to the queue. Because this state is used in multiple
> > > > > places, user might not be able to distinguish what they are waiting
> > > > > for. So It seems we'd better to use WAIT_EVENT_MQ_SEND here which will
> > > > > be eaier to distinguish and understand. Here is a tiny patch for that.
> > > > >
> > >
> > > As discussed[1], we'd better invent a new state for this purpose, so here is the patch
> > > that does the same.
> > >
> > > [1] https://www.postgresql.org/message-id/CAA4eK1LTud4FLRbS0QqdZ-pjSxwfFLHC1Dx%3D6Q7nyROCvvPSfw%40mail.gmail.com
> > >
> >
> > My first impression was the
> > WAIT_EVENT_LOGICAL_PARALLEL_APPLY_SEND_DATA name seemed misleading
> > because that makes it sound like the parallel apply worker is doing
> > the sending, but IIUC it's really the opposite.
> >
>
> So, how about WAIT_EVENT_LOGICAL_APPLY_SEND_DATA?
>

Yes, IIUC all the LR events are named WAIT_EVENT_LOGICAL_xxx.

So names like the below seem correct format:

a) WAIT_EVENT_LOGICAL_APPLY_SEND_DATA
b) WAIT_EVENT_LOGICAL_LEADER_SEND_DATA
c) WAIT_EVENT_LOGICAL_LEADER_APPLY_SEND_DATA

Of those, I prefer option c) because saying LEADER_APPLY_xxx matches
the name format of the existing
WAIT_EVENT_LOGICAL_PARALLEL_APPLY_STATE_CHANGE.

------
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Perform streaming logical transactions by background workers and parallel apply

From

Masahiko Sawada

Date:

14 February 2023, 14:14:51

On Tue, Feb 14, 2023 at 3:58 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Tue, Feb 14, 2023 at 5:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Feb 10, 2023 at 8:56 AM Peter Smith <smithpb2250@gmail.com> wrote:
> > >
> > > On Fri, Feb 10, 2023 at 1:32 PM houzj.fnst@fujitsu.com
> > > <houzj.fnst@fujitsu.com> wrote:
> > > >
> > > > On Tuesday, February 7, 2023 11:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > > On Mon, Feb 6, 2023 at 3:43 PM houzj.fnst@fujitsu.com
> > > > > <houzj.fnst@fujitsu.com> wrote:
> > > > > >
> > > > > > while reading the code, I noticed that in pa_send_data() we set wait
> > > > > > event to WAIT_EVENT_LOGICAL_PARALLEL_APPLY_STATE_CHANGE while
> > > > > sending
> > > > > > the message to the queue. Because this state is used in multiple
> > > > > > places, user might not be able to distinguish what they are waiting
> > > > > > for. So It seems we'd better to use WAIT_EVENT_MQ_SEND here which will
> > > > > > be eaier to distinguish and understand. Here is a tiny patch for that.
> > > > > >
> > > >
> > > > As discussed[1], we'd better invent a new state for this purpose, so here is the patch
> > > > that does the same.
> > > >
> > > > [1]
https://www.postgresql.org/message-id/CAA4eK1LTud4FLRbS0QqdZ-pjSxwfFLHC1Dx%3D6Q7nyROCvvPSfw%40mail.gmail.com
> > > >
> > >
> > > My first impression was the
> > > WAIT_EVENT_LOGICAL_PARALLEL_APPLY_SEND_DATA name seemed misleading
> > > because that makes it sound like the parallel apply worker is doing
> > > the sending, but IIUC it's really the opposite.
> > >
> >
> > So, how about WAIT_EVENT_LOGICAL_APPLY_SEND_DATA?
> >
>
> Yes, IIUC all the LR events are named WAIT_EVENT_LOGICAL_xxx.
>
> So names like the below seem correct format:
>
> a) WAIT_EVENT_LOGICAL_APPLY_SEND_DATA
> b) WAIT_EVENT_LOGICAL_LEADER_SEND_DATA
> c) WAIT_EVENT_LOGICAL_LEADER_APPLY_SEND_DATA

Personally I'm fine even without "LEADER" in the wait event name since
we don't have "who is waiting" in it. IIUC a row of pg_stat_activity
shows who, and the wait event name shows "what the process is
waiting". So I prefer (a).

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

15 February 2023, 02:33:57

On Tue, Feb 14, 2023 at 7:45 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Feb 14, 2023 at 3:58 PM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > On Tue, Feb 14, 2023 at 5:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Fri, Feb 10, 2023 at 8:56 AM Peter Smith <smithpb2250@gmail.com> wrote:
> > > >
> > > > My first impression was the
> > > > WAIT_EVENT_LOGICAL_PARALLEL_APPLY_SEND_DATA name seemed misleading
> > > > because that makes it sound like the parallel apply worker is doing
> > > > the sending, but IIUC it's really the opposite.
> > > >
> > >
> > > So, how about WAIT_EVENT_LOGICAL_APPLY_SEND_DATA?
> > >
> >
> > Yes, IIUC all the LR events are named WAIT_EVENT_LOGICAL_xxx.
> >
> > So names like the below seem correct format:
> >
> > a) WAIT_EVENT_LOGICAL_APPLY_SEND_DATA
> > b) WAIT_EVENT_LOGICAL_LEADER_SEND_DATA
> > c) WAIT_EVENT_LOGICAL_LEADER_APPLY_SEND_DATA
>
> Personally I'm fine even without "LEADER" in the wait event name since
> we don't have "who is waiting" in it. IIUC a row of pg_stat_activity
> shows who, and the wait event name shows "what the process is
> waiting". So I prefer (a).
>

This logic makes sense to me. So, let's go with (a).

-- 
With Regards,
Amit Kapila.

RE: Perform streaming logical transactions by background workers and parallel apply

From

"houzj.fnst@fujitsu.com"

Date:

15 February 2023, 03:25:13

On Wednesday, February 15, 2023 10:34 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Tue, Feb 14, 2023 at 7:45 PM Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
> >
> > On Tue, Feb 14, 2023 at 3:58 PM Peter Smith <smithpb2250@gmail.com>
> wrote:
> > >
> > > On Tue, Feb 14, 2023 at 5:04 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > > >
> > > > On Fri, Feb 10, 2023 at 8:56 AM Peter Smith <smithpb2250@gmail.com>
> wrote:
> > > > >
> > > > > My first impression was the
> > > > > WAIT_EVENT_LOGICAL_PARALLEL_APPLY_SEND_DATA name seemed
> > > > > misleading because that makes it sound like the parallel apply
> > > > > worker is doing the sending, but IIUC it's really the opposite.
> > > > >
> > > >
> > > > So, how about WAIT_EVENT_LOGICAL_APPLY_SEND_DATA?
> > > >
> > >
> > > Yes, IIUC all the LR events are named WAIT_EVENT_LOGICAL_xxx.
> > >
> > > So names like the below seem correct format:
> > >
> > > a) WAIT_EVENT_LOGICAL_APPLY_SEND_DATA
> > > b) WAIT_EVENT_LOGICAL_LEADER_SEND_DATA
> > > c) WAIT_EVENT_LOGICAL_LEADER_APPLY_SEND_DATA
> >
> > Personally I'm fine even without "LEADER" in the wait event name since
> > we don't have "who is waiting" in it. IIUC a row of pg_stat_activity
> > shows who, and the wait event name shows "what the process is
> > waiting". So I prefer (a).
> >
> 
> This logic makes sense to me. So, let's go with (a).

OK, here is patch that change the event name to WAIT_EVENT_LOGICAL_APPLY_SEND_DATA.

Best Regard,
Hou zj

Attachment

v2-0001-Add-a-new-wait-state-and-use-it-when-sending-data.patch

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

15 February 2023, 09:16:49

On Wed, Feb 15, 2023 at 8:55 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> On Wednesday, February 15, 2023 10:34 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > > >
> > > > So names like the below seem correct format:
> > > >
> > > > a) WAIT_EVENT_LOGICAL_APPLY_SEND_DATA
> > > > b) WAIT_EVENT_LOGICAL_LEADER_SEND_DATA
> > > > c) WAIT_EVENT_LOGICAL_LEADER_APPLY_SEND_DATA
> > >
> > > Personally I'm fine even without "LEADER" in the wait event name since
> > > we don't have "who is waiting" in it. IIUC a row of pg_stat_activity
> > > shows who, and the wait event name shows "what the process is
> > > waiting". So I prefer (a).
> > >
> >
> > This logic makes sense to me. So, let's go with (a).
>
> OK, here is patch that change the event name to WAIT_EVENT_LOGICAL_APPLY_SEND_DATA.
>

LGTM.

-- 
With Regards,
Amit Kapila.

Re: Perform streaming logical transactions by background workers and parallel apply

From

Peter Smith

Date:

15 February 2023, 23:31:13

LGTM. My only comment is about the commit message.

======
Commit message

d9d7fe6 reuse existing wait event when sending data in apply worker. But we
should have invent a new wait state if we are waiting at a new place, so fix
this.

~

SUGGESTION
d9d7fe6 made use of an existing wait event when sending data from the apply
worker, but we should have invented a new wait state since the code was
waiting at a new place.

This patch corrects the mistake by using a new wait state
"LogicalApplySendData".

------
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Perform streaming logical transactions by background workers and parallel apply

From

Masahiko Sawada

Date:

24 April 2023, 01:55:44

On Mon, Jan 9, 2023 at 5:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, Jan 8, 2023 at 11:32 AM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Sunday, January 8, 2023 11:59 AM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote:
> > > Attach the updated patch set.
> >
> > Sorry, the commit message of 0001 was accidentally deleted, just attach
> > the same patch set again with commit message.
> >
>
> Pushed the first (0001) patch.

While looking at the worker.c, I realized that we have the following
code in handle_streamed_transaction():

        default:
            Assert(false);
            return false;       / silence compiler warning /

I think it's better to do elog(ERROR) instead of Assert() as it ends
up returning false in non-assertion builds, which might cause a
problem. And it's more consistent with other codes in worker.c. Please
find an attached patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

0001-Use-elog-to-report-unexpected-action-in-handle_strea.patch

Re: Perform streaming logical transactions by background workers and parallel apply

From

Kyotaro Horiguchi

Date:

24 April 2023, 02:50:37

At Mon, 24 Apr 2023 10:55:44 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in 
> While looking at the worker.c, I realized that we have the following
> code in handle_streamed_transaction():
> 
>         default:
>             Assert(false);
>             return false;       / silence compiler warning /
> 
> I think it's better to do elog(ERROR) instead of Assert() as it ends
> up returning false in non-assertion builds, which might cause a
> problem. And it's more consistent with other codes in worker.c. Please
> find an attached patch.

I concur that returning false is problematic.

For assertion builds, Assert typically provides more detailed
information than elog. However, in this case, it wouldn't matter much
since the worker would repeatedly restart even after a server-restart
for the same reason unless cosmic rays are involved. Moreover, the
situation doesn't justify server-restaring, as it would unnecessarily
involve other backends.

In my opinion, it is fine to replace the Assert with an ERROR.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Perform streaming logical transactions by background workers and parallel apply

From

Kyotaro Horiguchi

Date:

24 April 2023, 02:55:46

At Mon, 24 Apr 2023 11:50:37 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> I concur that returning false is problematic.
> 
> For assertion builds, Assert typically provides more detailed
> information than elog. However, in this case, it wouldn't matter much
> since the worker would repeatedly restart even after a server-restart
> for the same reason unless cosmic rays are involved. Moreover, the

> situation doesn't justify server-restaring, as it would unnecessarily
> involve other backends.

Please disregard this part, as it's not relavant to non-assertion builds.

> In my opinion, it is fine to replace the Assert with an ERROR.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Perform streaming logical transactions by background workers and parallel apply

From

Kyotaro Horiguchi

Date:

24 April 2023, 03:10:12

At Mon, 24 Apr 2023 11:50:37 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> In my opinion, it is fine to replace the Assert with an ERROR.

Sorry for posting multiple times in a row, but I'm a bit unceratin
whether we should use FATAL or ERROR for this situation. The stream is
not provided by user, and the session or process cannot continue.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

24 April 2023, 03:29:07

On Mon, Apr 24, 2023 at 8:40 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Mon, 24 Apr 2023 11:50:37 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
> > In my opinion, it is fine to replace the Assert with an ERROR.
>
> Sorry for posting multiple times in a row, but I'm a bit unceratin
> whether we should use FATAL or ERROR for this situation. The stream is
> not provided by user, and the session or process cannot continue.
>

I think ERROR should be fine here similar to other cases in worker.c.

--
With Regards,
Amit Kapila.

Re: Perform streaming logical transactions by background workers and parallel apply

From

Kyotaro Horiguchi

Date:

24 April 2023, 04:03:03

At Mon, 24 Apr 2023 08:59:07 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in 
> > Sorry for posting multiple times in a row, but I'm a bit unceratin
> > whether we should use FATAL or ERROR for this situation. The stream is
> > not provided by user, and the session or process cannot continue.
> >
> 
> I think ERROR should be fine here similar to other cases in worker.c.

Sure, I don't have any issues with it.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

24 April 2023, 05:24:11

On Mon, Apr 24, 2023 at 7:26 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> While looking at the worker.c, I realized that we have the following
> code in handle_streamed_transaction():
>
>         default:
>             Assert(false);
>             return false;       / silence compiler warning /
>
> I think it's better to do elog(ERROR) instead of Assert() as it ends
> up returning false in non-assertion builds, which might cause a
> problem. And it's more consistent with other codes in worker.c. Please
> find an attached patch.
>

I haven't tested it but otherwise, the changes look good to me.

--
With Regards,
Amit Kapila.

Re: Perform streaming logical transactions by background workers and parallel apply

From

Masahiko Sawada

Date:

24 April 2023, 06:42:56

On Mon, Apr 24, 2023 at 2:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Apr 24, 2023 at 7:26 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > While looking at the worker.c, I realized that we have the following
> > code in handle_streamed_transaction():
> >
> >         default:
> >             Assert(false);
> >             return false;       / silence compiler warning /
> >
> > I think it's better to do elog(ERROR) instead of Assert() as it ends
> > up returning false in non-assertion builds, which might cause a
> > problem. And it's more consistent with other codes in worker.c. Please
> > find an attached patch.
> >
>
> I haven't tested it but otherwise, the changes look good to me.

Thanks for checking! Pushed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Perform streaming logical transactions by background workers and parallel apply

From

Alexander Lakhin

Date:

26 April 2023, 09:00:02

Hello hackers,

Please look at a new anomaly that can be observed starting from 216a7848.

The following script:
echo "CREATE SUBSCRIPTION testsub CONNECTION 'dbname=nodb' PUBLICATION testpub WITH (connect = false);
ALTER SUBSCRIPTION testsub ENABLE;" | psql

sleep 1
rm $PGINST/lib/libpqwalreceiver.so
sleep 15
pg_ctl -D "$PGDB" stop -m immediate
grep 'TRAP:' server.log

Leads to multiple assertion failures:
CREATE SUBSCRIPTION
ALTER SUBSCRIPTION
waiting for server to shut down.... done
server stopped
TRAP: failed Assert("MyProc->backendId != InvalidBackendId"), File: "lock.c", Line: 4439, PID: 2899323
TRAP: failed Assert("MyProc->backendId != InvalidBackendId"), File: "lock.c", Line: 4439, PID: 2899416
TRAP: failed Assert("MyProc->backendId != InvalidBackendId"), File: "lock.c", Line: 4439, PID: 2899427
TRAP: failed Assert("MyProc->backendId != InvalidBackendId"), File: "lock.c", Line: 4439, PID: 2899439
TRAP: failed Assert("MyProc->backendId != InvalidBackendId"), File: "lock.c", Line: 4439, PID: 2899538
TRAP: failed Assert("MyProc->backendId != InvalidBackendId"), File: "lock.c", Line: 4439, PID: 2899547

server.log contains:
2023-04-26 11:00:58.797 MSK [2899300] LOG:  database system is ready to accept connections
2023-04-26 11:00:58.821 MSK [2899416] ERROR:  could not access file "libpqwalreceiver": No such file or directory
TRAP: failed Assert("MyProc->backendId != InvalidBackendId"), File: "lock.c", Line: 4439, PID: 2899416
postgres: logical replication apply worker for subscription 16385 (ExceptionalCondition+0x69)[0x558b2ac06d41]
postgres: logical replication apply worker for subscription 16385 (VirtualXactLockTableCleanup+0xa4)[0x558b2aa9fd74]
postgres: logical replication apply worker for subscription 16385 (LockReleaseAll+0xbb)[0x558b2aa9fe7d]
postgres: logical replication apply worker for subscription 16385 (+0x4588c6)[0x558b2aa2a8c6]
postgres: logical replication apply worker for subscription 16385 (shmem_exit+0x6c)[0x558b2aa87eb1]
postgres: logical replication apply worker for subscription 16385 (+0x4b5faa)[0x558b2aa87faa]
postgres: logical replication apply worker for subscription 16385 (proc_exit+0xc)[0x558b2aa88031]
postgres: logical replication apply worker for subscription 16385 (StartBackgroundWorker+0x147)[0x558b2aa0b4d9]
postgres: logical replication apply worker for subscription 16385 (+0x43fdc1)[0x558b2aa11dc1]
postgres: logical replication apply worker for subscription 16385 (+0x43ff3d)[0x558b2aa11f3d]
postgres: logical replication apply worker for subscription 16385 (+0x440866)[0x558b2aa12866]
postgres: logical replication apply worker for subscription 16385 (+0x440e12)[0x558b2aa12e12]
postgres: logical replication apply worker for subscription 16385
(BackgroundWorkerInitializeConnection+0x0)[0x558b2aa14396]
postgres: logical replication apply worker for subscription 16385 (main+0x21a)[0x558b2a932e21]

I understand, that removing libpqwalreceiver.so (or whole pginst/) is not
what happens in a production environment every day, but nonetheless it's a
new failure mode and it can produce many coredumps when testing.

IIUC, that assert will fail in case of any error raised between
ApplyWorkerMain()->logicalrep_worker_attach()->before_shmem_exit() and
ApplyWorkerMain()->InitializeApplyWorker()->BackgroundWorkerInitializeConnectionByOid()->InitPostgres().

Best regards,
Alexander

RE: Perform streaming logical transactions by background workers and parallel apply

From

"Zhijie Hou (Fujitsu)"

Date:

26 April 2023, 10:41:22

On Wednesday, April 26, 2023 5:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
> Please look at a new anomaly that can be observed starting from 216a7848.
> 
> The following script:
> echo "CREATE SUBSCRIPTION testsub CONNECTION 'dbname=nodb'
> PUBLICATION testpub WITH (connect = false);
> ALTER SUBSCRIPTION testsub ENABLE;" | psql
> 
> sleep 1
> rm $PGINST/lib/libpqwalreceiver.so
> sleep 15
> pg_ctl -D "$PGDB" stop -m immediate
> grep 'TRAP:' server.log
> 
> Leads to multiple assertion failures:
> CREATE SUBSCRIPTION
> ALTER SUBSCRIPTION
> waiting for server to shut down.... done
> server stopped
> TRAP: failed Assert("MyProc->backendId != InvalidBackendId"), File: "lock.c",
> Line: 4439, PID: 2899323
> TRAP: failed Assert("MyProc->backendId != InvalidBackendId"), File: "lock.c",
> Line: 4439, PID: 2899416
> TRAP: failed Assert("MyProc->backendId != InvalidBackendId"), File: "lock.c",
> Line: 4439, PID: 2899427
> TRAP: failed Assert("MyProc->backendId != InvalidBackendId"), File: "lock.c",
> Line: 4439, PID: 2899439
> TRAP: failed Assert("MyProc->backendId != InvalidBackendId"), File: "lock.c",
> Line: 4439, PID: 2899538
> TRAP: failed Assert("MyProc->backendId != InvalidBackendId"), File: "lock.c",
> Line: 4439, PID: 2899547
> 
> server.log contains:
> 2023-04-26 11:00:58.797 MSK [2899300] LOG:  database system is ready to
> accept connections
> 2023-04-26 11:00:58.821 MSK [2899416] ERROR:  could not access file
> "libpqwalreceiver": No such file or directory
> TRAP: failed Assert("MyProc->backendId != InvalidBackendId"), File: "lock.c",
> Line: 4439, PID: 2899416
> postgres: logical replication apply worker for subscription 16385
> (ExceptionalCondition+0x69)[0x558b2ac06d41]
> postgres: logical replication apply worker for subscription 16385
> (VirtualXactLockTableCleanup+0xa4)[0x558b2aa9fd74]
> postgres: logical replication apply worker for subscription 16385
> (LockReleaseAll+0xbb)[0x558b2aa9fe7d]
> postgres: logical replication apply worker for subscription 16385
> (+0x4588c6)[0x558b2aa2a8c6]
> postgres: logical replication apply worker for subscription 16385
> (shmem_exit+0x6c)[0x558b2aa87eb1]
> postgres: logical replication apply worker for subscription 16385
> (+0x4b5faa)[0x558b2aa87faa]
> postgres: logical replication apply worker for subscription 16385
> (proc_exit+0xc)[0x558b2aa88031]
> postgres: logical replication apply worker for subscription 16385
> (StartBackgroundWorker+0x147)[0x558b2aa0b4d9]
> postgres: logical replication apply worker for subscription 16385
> (+0x43fdc1)[0x558b2aa11dc1]
> postgres: logical replication apply worker for subscription 16385
> (+0x43ff3d)[0x558b2aa11f3d]
> postgres: logical replication apply worker for subscription 16385
> (+0x440866)[0x558b2aa12866]
> postgres: logical replication apply worker for subscription 16385
> (+0x440e12)[0x558b2aa12e12]
> postgres: logical replication apply worker for subscription 16385
> (BackgroundWorkerInitializeConnection+0x0)[0x558b2aa14396]
> postgres: logical replication apply worker for subscription 16385
> (main+0x21a)[0x558b2a932e21]
> 
> I understand, that removing libpqwalreceiver.so (or whole pginst/) is not
> what happens in a production environment every day, but nonetheless it's a
> new failure mode and it can produce many coredumps when testing.
> 
> IIUC, that assert will fail in case of any error raised between
> ApplyWorkerMain()->logicalrep_worker_attach()->before_shmem_exit() and
> ApplyWorkerMain()->InitializeApplyWorker()->BackgroundWorkerInitializeC
> onnectionByOid()->InitPostgres().

Thanks for reporting the issue.

I think the problem is that it tried to release locks in
logicalrep_worker_onexit() before the initialization of the process is complete
because this callback function was registered before the init phase. So I think we
can add a conditional statement before releasing locks. Please find an attached
patch.

Best Regards,
Hou zj

Attachment

0001-fix-assert-failure-in-logical-replication-apply-work.patch

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

26 April 2023, 11:21:11

On Wed, Apr 26, 2023 at 4:11 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Wednesday, April 26, 2023 5:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
>
> Thanks for reporting the issue.
>
> I think the problem is that it tried to release locks in
> logicalrep_worker_onexit() before the initialization of the process is complete
> because this callback function was registered before the init phase. So I think we
> can add a conditional statement before releasing locks. Please find an attached
> patch.
>

Yeah, this should work. Yet another possibility is to introduce a new
variable 'InitializingApplyWorker' similar to
'InitializingParallelWorker' and use that to prevent releasing locks.

--
With Regards,
Amit Kapila.

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

28 April 2023, 02:51:22

On Wed, Apr 26, 2023 at 4:11 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Wednesday, April 26, 2023 5:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
> >
> > IIUC, that assert will fail in case of any error raised between
> > ApplyWorkerMain()->logicalrep_worker_attach()->before_shmem_exit() and
> > ApplyWorkerMain()->InitializeApplyWorker()->BackgroundWorkerInitializeC
> > onnectionByOid()->InitPostgres().
>
> Thanks for reporting the issue.
>
> I think the problem is that it tried to release locks in
> logicalrep_worker_onexit() before the initialization of the process is complete
> because this callback function was registered before the init phase. So I think we
> can add a conditional statement before releasing locks. Please find an attached
> patch.
>

Alexander, does the proposed patch fix the problem you are facing?
Sawada-San, and others, do you see any better way to fix it than what
has been proposed?

--
With Regards,
Amit Kapila.

Re: Perform streaming logical transactions by background workers and parallel apply

From

Alexander Lakhin

Date:

28 April 2023, 05:00:01

Hello Amit and Zhijie,

28.04.2023 05:51, Amit Kapila wrote:
> On Wed, Apr 26, 2023 at 4:11 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
>> I think the problem is that it tried to release locks in
>> logicalrep_worker_onexit() before the initialization of the process is complete
>> because this callback function was registered before the init phase. So I think we
>> can add a conditional statement before releasing locks. Please find an attached
>> patch.
> Alexander, does the proposed patch fix the problem you are facing?
> Sawada-San, and others, do you see any better way to fix it than what
> has been proposed?

Yes, the patch definitely fixes it.
Maybe some other onexit actions can be skipped in the non-normal mode,
but the assert-triggering LockReleaseAll() not called now.

Thank you!

Best regards,
Alexander

Re: Perform streaming logical transactions by background workers and parallel apply

From

Masahiko Sawada

Date:

28 April 2023, 06:18:01

On Fri, Apr 28, 2023 at 11:51 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Apr 26, 2023 at 4:11 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Wednesday, April 26, 2023 5:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
> > >
> > > IIUC, that assert will fail in case of any error raised between
> > > ApplyWorkerMain()->logicalrep_worker_attach()->before_shmem_exit() and
> > > ApplyWorkerMain()->InitializeApplyWorker()->BackgroundWorkerInitializeC
> > > onnectionByOid()->InitPostgres().
> >
> > Thanks for reporting the issue.
> >
> > I think the problem is that it tried to release locks in
> > logicalrep_worker_onexit() before the initialization of the process is complete
> > because this callback function was registered before the init phase. So I think we
> > can add a conditional statement before releasing locks. Please find an attached
> > patch.
> >
>
> Alexander, does the proposed patch fix the problem you are facing?
> Sawada-San, and others, do you see any better way to fix it than what
> has been proposed?

I'm concerned that the idea of relying on IsNormalProcessingMode()
might not be robust since if we change the meaning of
IsNormalProcessingMode() some day it would silently break again. So I
prefer using something like InitializingApplyWorker, or another idea
would be to do cleanup work (e.g., fileset deletion and lock release)
in a separate callback that is registered after connecting to the
database.

While investigating this issue, I've reviewed the code around
callbacks and worker termination etc and I found a problem.

A parallel apply worker calls the before_shmem_exit callbacks in the
following order:

1. ShutdownPostgres()
2. logicalrep_worker_onexit()
3. pa_shutdown()

Since the worker is detached during logicalrep_worker_onexit(),
MyLogicalReplication->leader_pid is an invalid when we call
pa_shutdown():

static void
pa_shutdown(int code, Datum arg)
{
    Assert(MyLogicalRepWorker->leader_pid != InvalidPid);
    SendProcSignal(MyLogicalRepWorker->leader_pid,
                   PROCSIG_PARALLEL_APPLY_MESSAGE,
                   InvalidBackendId);

Also, if the parallel apply worker fails shm_toc_lookup() during the
initialization, it raises an error (because of noError = false) but
ends up a SEGV as MyLogicalRepWorker is still NULL.

I think that we should not use MyLogicalRepWorker->leader_pid in
pa_shutdown() but instead store the leader's pid to a static variable
before registering pa_shutdown() callback. And probably we can
remember the backend id of the leader apply worker to speed up
SendProcSignal().

FWIW, we might need to be careful about the timing when we call
logicalrep_worker_detach() in the worker's termination process. Since
we rely on IsLogicalParallelApplyWorker() for the parallel apply
worker to send ERROR messages to the leader apply worker, if an ERROR
happens after logicalrep_worker_detach(), we will end up with the
assertion failure.

            if (IsLogicalParallelApplyWorker())
                SendProcSignal(pq_mq_parallel_leader_pid,
                               PROCSIG_PARALLEL_APPLY_MESSAGE,
                               pq_mq_parallel_leader_backend_id);
            else
            {
                Assert(IsParallelWorker());

It normally would be a should-no-happen case, though.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

28 April 2023, 09:01:11

On Fri, Apr 28, 2023 at 11:48 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Apr 28, 2023 at 11:51 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Apr 26, 2023 at 4:11 PM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > On Wednesday, April 26, 2023 5:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
> > > >
> > > > IIUC, that assert will fail in case of any error raised between
> > > > ApplyWorkerMain()->logicalrep_worker_attach()->before_shmem_exit() and
> > > > ApplyWorkerMain()->InitializeApplyWorker()->BackgroundWorkerInitializeC
> > > > onnectionByOid()->InitPostgres().
> > >
> > > Thanks for reporting the issue.
> > >
> > > I think the problem is that it tried to release locks in
> > > logicalrep_worker_onexit() before the initialization of the process is complete
> > > because this callback function was registered before the init phase. So I think we
> > > can add a conditional statement before releasing locks. Please find an attached
> > > patch.
> > >
> >
> > Alexander, does the proposed patch fix the problem you are facing?
> > Sawada-San, and others, do you see any better way to fix it than what
> > has been proposed?
>
> I'm concerned that the idea of relying on IsNormalProcessingMode()
> might not be robust since if we change the meaning of
> IsNormalProcessingMode() some day it would silently break again. So I
> prefer using something like InitializingApplyWorker,
>

I think if we change the meaning of IsNormalProcessingMode() then it
could also break the other places the similar check is being used.
However, I am fine with InitializingApplyWorker as that could be used
at other places as well. I just want to avoid adding another variable
by using IsNormalProcessingMode.

> or another idea
> would be to do cleanup work (e.g., fileset deletion and lock release)
> in a separate callback that is registered after connecting to the
> database.
>

Yeah, but not sure if it's worth having multiple callbacks for cleanup work.

--
With Regards,
Amit Kapila.

Re: Perform streaming logical transactions by background workers and parallel apply

From

Masahiko Sawada

Date:

01 May 2023, 03:52:06

On Fri, Apr 28, 2023 at 6:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Apr 28, 2023 at 11:48 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Apr 28, 2023 at 11:51 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Wed, Apr 26, 2023 at 4:11 PM Zhijie Hou (Fujitsu)
> > > <houzj.fnst@fujitsu.com> wrote:
> > > >
> > > > On Wednesday, April 26, 2023 5:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
> > > > >
> > > > > IIUC, that assert will fail in case of any error raised between
> > > > > ApplyWorkerMain()->logicalrep_worker_attach()->before_shmem_exit() and
> > > > > ApplyWorkerMain()->InitializeApplyWorker()->BackgroundWorkerInitializeC
> > > > > onnectionByOid()->InitPostgres().
> > > >
> > > > Thanks for reporting the issue.
> > > >
> > > > I think the problem is that it tried to release locks in
> > > > logicalrep_worker_onexit() before the initialization of the process is complete
> > > > because this callback function was registered before the init phase. So I think we
> > > > can add a conditional statement before releasing locks. Please find an attached
> > > > patch.
> > > >
> > >
> > > Alexander, does the proposed patch fix the problem you are facing?
> > > Sawada-San, and others, do you see any better way to fix it than what
> > > has been proposed?
> >
> > I'm concerned that the idea of relying on IsNormalProcessingMode()
> > might not be robust since if we change the meaning of
> > IsNormalProcessingMode() some day it would silently break again. So I
> > prefer using something like InitializingApplyWorker,
> >
>
> I think if we change the meaning of IsNormalProcessingMode() then it
> could also break the other places the similar check is being used.

Right, but I think it's unclear the relationship between the
processing modes and releasing session locks. If non-normal-processing
mode means we're still in the process initialization phase, why we
don't skip other cleanup works such as walrcv_disconnect() and
FileSetDeleteAll()?

> However, I am fine with InitializingApplyWorker as that could be used
> at other places as well. I just want to avoid adding another variable
> by using IsNormalProcessingMode.

I think it's less confusing.

>
> > or another idea
> > would be to do cleanup work (e.g., fileset deletion and lock release)
> > in a separate callback that is registered after connecting to the
> > database.
> >
>
> Yeah, but not sure if it's worth having multiple callbacks for cleanup work.

Fair point.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

02 May 2023, 03:21:45

On Fri, Apr 28, 2023 at 11:48 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> While investigating this issue, I've reviewed the code around
> callbacks and worker termination etc and I found a problem.
>
> A parallel apply worker calls the before_shmem_exit callbacks in the
> following order:
>
> 1. ShutdownPostgres()
> 2. logicalrep_worker_onexit()
> 3. pa_shutdown()
>
> Since the worker is detached during logicalrep_worker_onexit(),
> MyLogicalReplication->leader_pid is an invalid when we call
> pa_shutdown():
>
> static void
> pa_shutdown(int code, Datum arg)
> {
>     Assert(MyLogicalRepWorker->leader_pid != InvalidPid);
>     SendProcSignal(MyLogicalRepWorker->leader_pid,
>                    PROCSIG_PARALLEL_APPLY_MESSAGE,
>                    InvalidBackendId);
>
> Also, if the parallel apply worker fails shm_toc_lookup() during the
> initialization, it raises an error (because of noError = false) but
> ends up a SEGV as MyLogicalRepWorker is still NULL.
>
> I think that we should not use MyLogicalRepWorker->leader_pid in
> pa_shutdown() but instead store the leader's pid to a static variable
> before registering pa_shutdown() callback.
>

Why not simply move the registration of pa_shutdown() to someplace
after logicalrep_worker_attach()? BTW, it seems we don't have access
to MyLogicalRepWorker->leader_pid till we attach to the worker slot
via logicalrep_worker_attach(), so we anyway need to do what you are
suggesting after attaching to the worker slot.

--
With Regards,
Amit Kapila.

RE: Perform streaming logical transactions by background workers and parallel apply

From

"Zhijie Hou (Fujitsu)"

Date:

02 May 2023, 03:35:58

On Friday, April 28, 2023 2:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> 
> On Fri, Apr 28, 2023 at 11:51 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Apr 26, 2023 at 4:11 PM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > On Wednesday, April 26, 2023 5:00 PM Alexander Lakhin
> <exclusion@gmail.com> wrote:
> > > >
> > > > IIUC, that assert will fail in case of any error raised between
> > > >
> ApplyWorkerMain()->logicalrep_worker_attach()->before_shmem_exit() and
> > > >
> ApplyWorkerMain()->InitializeApplyWorker()->BackgroundWorkerInitializeC
> > > > onnectionByOid()->InitPostgres().
> > >
> > > Thanks for reporting the issue.
> > >
> > > I think the problem is that it tried to release locks in
> > > logicalrep_worker_onexit() before the initialization of the process is
> complete
> > > because this callback function was registered before the init phase. So I
> think we
> > > can add a conditional statement before releasing locks. Please find an
> attached
> > > patch.
> > >
> >
> > Alexander, does the proposed patch fix the problem you are facing?
> > Sawada-San, and others, do you see any better way to fix it than what
> > has been proposed?
> 
> I'm concerned that the idea of relying on IsNormalProcessingMode()
> might not be robust since if we change the meaning of
> IsNormalProcessingMode() some day it would silently break again. So I
> prefer using something like InitializingApplyWorker, or another idea
> would be to do cleanup work (e.g., fileset deletion and lock release)
> in a separate callback that is registered after connecting to the
> database.

Thanks for the review. I agree that it’s better to use a new variable here.
Attach the patch for the same.


> 
> FWIW, we might need to be careful about the timing when we call
> logicalrep_worker_detach() in the worker's termination process. Since
> we rely on IsLogicalParallelApplyWorker() for the parallel apply
> worker to send ERROR messages to the leader apply worker, if an ERROR
> happens after logicalrep_worker_detach(), we will end up with the
> assertion failure.
> 
>             if (IsLogicalParallelApplyWorker())
>                 SendProcSignal(pq_mq_parallel_leader_pid,
>                                PROCSIG_PARALLEL_APPLY_MESSAGE,
>                                pq_mq_parallel_leader_backend_id);
>             else
>             {
>                 Assert(IsParallelWorker());
>
> It normally would be a should-no-happen case, though.

Yes, I think currently PA sends ERROR message before exiting,
so the callback functions are always fired after the above code which
looks fine to me.

Best Regards,
Hou zj

Attachment

v2-0001-Fix-assert-failure-in-logical-replication-apply-w.patch

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

02 May 2023, 04:16:21

On Tue, May 2, 2023 at 9:06 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Friday, April 28, 2023 2:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > >
> > > Alexander, does the proposed patch fix the problem you are facing?
> > > Sawada-San, and others, do you see any better way to fix it than what
> > > has been proposed?
> >
> > I'm concerned that the idea of relying on IsNormalProcessingMode()
> > might not be robust since if we change the meaning of
> > IsNormalProcessingMode() some day it would silently break again. So I
> > prefer using something like InitializingApplyWorker, or another idea
> > would be to do cleanup work (e.g., fileset deletion and lock release)
> > in a separate callback that is registered after connecting to the
> > database.
>
> Thanks for the review. I agree that it’s better to use a new variable here.
> Attach the patch for the same.
>

+ *
+ * However, if the worker is being initialized, there is no need to release
+ * locks.
  */
- LockReleaseAll(DEFAULT_LOCKMETHOD, true);
+ if (!InitializingApplyWorker)
+ LockReleaseAll(DEFAULT_LOCKMETHOD, true);

Can we slightly reword this comment as: "The locks will be acquired
once the worker is initialized."?

--
With Regards,
Amit Kapila.

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

03 May 2023, 07:17:06

On Tue, May 2, 2023 at 9:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, May 2, 2023 at 9:06 AM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Friday, April 28, 2023 2:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > >
> > > > Alexander, does the proposed patch fix the problem you are facing?
> > > > Sawada-San, and others, do you see any better way to fix it than what
> > > > has been proposed?
> > >
> > > I'm concerned that the idea of relying on IsNormalProcessingMode()
> > > might not be robust since if we change the meaning of
> > > IsNormalProcessingMode() some day it would silently break again. So I
> > > prefer using something like InitializingApplyWorker, or another idea
> > > would be to do cleanup work (e.g., fileset deletion and lock release)
> > > in a separate callback that is registered after connecting to the
> > > database.
> >
> > Thanks for the review. I agree that it’s better to use a new variable here.
> > Attach the patch for the same.
> >
>
> + *
> + * However, if the worker is being initialized, there is no need to release
> + * locks.
>   */
> - LockReleaseAll(DEFAULT_LOCKMETHOD, true);
> + if (!InitializingApplyWorker)
> + LockReleaseAll(DEFAULT_LOCKMETHOD, true);
>
> Can we slightly reword this comment as: "The locks will be acquired
> once the worker is initialized."?
>

After making this modification, I pushed your patch. Thanks!

--
With Regards,
Amit Kapila.

RE: Perform streaming logical transactions by background workers and parallel apply

From

"Zhijie Hou (Fujitsu)"

Date:

05 May 2023, 03:44:47

On Wednesday, May 3, 2023 3:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Tue, May 2, 2023 at 9:46 AM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > On Tue, May 2, 2023 at 9:06 AM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > On Friday, April 28, 2023 2:18 PM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> > > >
> > > > >
> > > > > Alexander, does the proposed patch fix the problem you are facing?
> > > > > Sawada-San, and others, do you see any better way to fix it than
> > > > > what has been proposed?
> > > >
> > > > I'm concerned that the idea of relying on IsNormalProcessingMode()
> > > > might not be robust since if we change the meaning of
> > > > IsNormalProcessingMode() some day it would silently break again.
> > > > So I prefer using something like InitializingApplyWorker, or
> > > > another idea would be to do cleanup work (e.g., fileset deletion
> > > > and lock release) in a separate callback that is registered after
> > > > connecting to the database.
> > >
> > > Thanks for the review. I agree that it’s better to use a new variable here.
> > > Attach the patch for the same.
> > >
> >
> > + *
> > + * However, if the worker is being initialized, there is no need to
> > + release
> > + * locks.
> >   */
> > - LockReleaseAll(DEFAULT_LOCKMETHOD, true);
> > + if (!InitializingApplyWorker)
> > + LockReleaseAll(DEFAULT_LOCKMETHOD, true);
> >
> > Can we slightly reword this comment as: "The locks will be acquired
> > once the worker is initialized."?
> >
> 
> After making this modification, I pushed your patch. Thanks!

Thanks for pushing.

Attach another patch to fix the problem that pa_shutdown will access invalid
MyLogicalRepWorker. I personally want to avoid introducing new static variable,
so I only reorder the callback registration in this version.

When testing this, I notice a rare case that the leader is possible to receive
the worker termination message after the leader stops the parallel worker. This
is unnecessary and have a risk that the leader would try to access the detached
memory queue. This is more likely to happen and sometimes cause the failure in
regression tests after the registration reorder patch because the dsm is
detached earlier after applying the patch.

So, put the patch that detach the error queue before stopping worker as 0001
and the registration reorder patch as 0002.

Best Regards,
Hou zj

On Fri, May 5, 2023 at 9:14 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Wednesday, May 3, 2023 3:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
>
> Attach another patch to fix the problem that pa_shutdown will access invalid
> MyLogicalRepWorker. I personally want to avoid introducing new static variable,
> so I only reorder the callback registration in this version.
>
> When testing this, I notice a rare case that the leader is possible to receive
> the worker termination message after the leader stops the parallel worker. This
> is unnecessary and have a risk that the leader would try to access the detached
> memory queue. This is more likely to happen and sometimes cause the failure in
> regression tests after the registration reorder patch because the dsm is
> detached earlier after applying the patch.
>

I think it is only possible for the leader apply can worker to try to
receive the error message from an error queue after your 0002 patch.
Because another place already detached from the queue before stopping
the parallel apply workers. So, I combined both the patches and
changed a few comments and a commit message. Let me know what you
think of the attached.

--
With Regards,
Amit Kapila.

Attachment

v2-0001-Fix-invalid-memory-access-during-the-shutdown-of-.patch

Re: Perform streaming logical transactions by background workers and parallel apply

From

Masahiko Sawada

Date:

09 May 2023, 02:19:20

On Mon, May 8, 2023 at 3:34 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, May 8, 2023 at 11:08 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, May 8, 2023 at 12:52 PM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > On Monday, May 8, 2023 11:08 AM Masahiko Sawada <sawada.mshk@gmail.com>
> > >
> > > Hi,
> > >
> > > >
> > > > On Tue, May 2, 2023 at 12:22 PM Amit Kapila <amit.kapila16@gmail.com>
> > > > wrote:
> > > > >
> > > > > On Fri, Apr 28, 2023 at 11:48 AM Masahiko Sawada
> > > > <sawada.mshk@gmail.com> wrote:
> > > > > >
> > > > > > While investigating this issue, I've reviewed the code around
> > > > > > callbacks and worker termination etc and I found a problem.
> > > > > >
> > > > > > A parallel apply worker calls the before_shmem_exit callbacks in the
> > > > > > following order:
> > > > > >
> > > > > > 1. ShutdownPostgres()
> > > > > > 2. logicalrep_worker_onexit()
> > > > > > 3. pa_shutdown()
> > > > > >
> > > > > > Since the worker is detached during logicalrep_worker_onexit(),
> > > > > > MyLogicalReplication->leader_pid is an invalid when we call
> > > > > > pa_shutdown():
> > > > > >
> > > > > > static void
> > > > > > pa_shutdown(int code, Datum arg)
> > > > > > {
> > > > > >     Assert(MyLogicalRepWorker->leader_pid != InvalidPid);
> > > > > >     SendProcSignal(MyLogicalRepWorker->leader_pid,
> > > > > >                    PROCSIG_PARALLEL_APPLY_MESSAGE,
> > > > > >                    InvalidBackendId);
> > > > > >
> > > > > > Also, if the parallel apply worker fails shm_toc_lookup() during the
> > > > > > initialization, it raises an error (because of noError = false) but
> > > > > > ends up a SEGV as MyLogicalRepWorker is still NULL.
> > > > > >
> > > > > > I think that we should not use MyLogicalRepWorker->leader_pid in
> > > > > > pa_shutdown() but instead store the leader's pid to a static variable
> > > > > > before registering pa_shutdown() callback.
> > > > > >
> > > > >
> > > > > Why not simply move the registration of pa_shutdown() to someplace
> > > > > after logicalrep_worker_attach()?
> > > >
> > > > If we do that, the worker won't call dsm_detach() if it raises an
> > > > ERROR in logicalrep_worker_attach(), is that okay? It seems that it's
> > > > no practically problem since we call dsm_backend_shutdown() in
> > > > shmem_exit(), but if so why do we need to call it in pa_shutdown()?
> > >
> > > I think the dsm_detach in pa_shutdown was intended to fire on_dsm_detach
> > > callbacks to give callback a chance to report stat before the stat system is
> > > shutdown, following what we do in ParallelWorkerShutdown() (e.g.
> > > sharedfileset.c callbacks cause fd.c to do ReportTemporaryFileUsage(), so we
> > > need to fire that earlier).
> > >
> > > But for parallel apply, we currently only have one on_dsm_detach
> > > callback(shm_mq_detach_callback) which doesn't report extra stats. So the
> > > dsm_detach in pa_shutdown is only used to make it a bit future-proof in case
> > > we add some other on_dsm_detach callbacks in the future which need to report
> > > stats.
> >
> > Make sense . Given that it's possible that we add other callbacks that
> > report stats in the future, I think it's better not to move the
> > position to register pa_shutdown() callback.
> >
>
> Hmm, what kind of stats do we expect to be collected before we
> register pa_shutdown? I think if required we can register such a
> callback after pa_shutdown. I feel without reordering the callbacks,
> the fix would be a bit complicated as explained in my previous email,
> so I don't think it is worth complicating this code unless really
> required.

Fair point. I agree that the issue can be resolved by carefully
ordering the callback registration.

Another thing I'm concerned about is that since both the leader worker
and parallel worker detach DSM before logicalrep_worker_onexit(),
cleaning up work that touches DSM cannot be done in
logicalrep_worker_onexit(). If we need to do something in the future,
we would need to have another callback called before detaching DSM.
But I'm fine as it's not a problem for now.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Perform streaming logical transactions by background workers and parallel apply

From

Masahiko Sawada

Date:

09 May 2023, 02:19:47

On Mon, May 8, 2023 at 8:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, May 5, 2023 at 9:14 AM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Wednesday, May 3, 2023 3:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> >
> > Attach another patch to fix the problem that pa_shutdown will access invalid
> > MyLogicalRepWorker. I personally want to avoid introducing new static variable,
> > so I only reorder the callback registration in this version.
> >
> > When testing this, I notice a rare case that the leader is possible to receive
> > the worker termination message after the leader stops the parallel worker. This
> > is unnecessary and have a risk that the leader would try to access the detached
> > memory queue. This is more likely to happen and sometimes cause the failure in
> > regression tests after the registration reorder patch because the dsm is
> > detached earlier after applying the patch.
> >
>
> I think it is only possible for the leader apply can worker to try to
> receive the error message from an error queue after your 0002 patch.
> Because another place already detached from the queue before stopping
> the parallel apply workers. So, I combined both the patches and
> changed a few comments and a commit message. Let me know what you
> think of the attached.

I have one comment on the detaching error queue part:

+       /*
+        * Detach from the error_mq_handle for the parallel apply worker before
+        * stopping it. This prevents the leader apply worker from trying to
+        * receive the message from the error queue that might already
be detached
+        * by the parallel apply worker.
+        */
+       shm_mq_detach(winfo->error_mq_handle);
+       winfo->error_mq_handle = NULL;

In pa_detach_all_error_mq(), we try to detach error queues of all
workers in the pool. I think we should check if the queue is already
detached (i.e. is NULL) there. Otherwise, we will end up a SEGV if an
error happens after detaching the error queue and before removing the
worker from the pool.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Perform streaming logical transactions by background workers and parallel apply

From

Amit Kapila

Date:

09 May 2023, 10:01:46

On Tue, May 9, 2023 at 7:50 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, May 8, 2023 at 8:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > I think it is only possible for the leader apply can worker to try to
> > receive the error message from an error queue after your 0002 patch.
> > Because another place already detached from the queue before stopping
> > the parallel apply workers. So, I combined both the patches and
> > changed a few comments and a commit message. Let me know what you
> > think of the attached.
>
> I have one comment on the detaching error queue part:
>
> +       /*
> +        * Detach from the error_mq_handle for the parallel apply worker before
> +        * stopping it. This prevents the leader apply worker from trying to
> +        * receive the message from the error queue that might already
> be detached
> +        * by the parallel apply worker.
> +        */
> +       shm_mq_detach(winfo->error_mq_handle);
> +       winfo->error_mq_handle = NULL;
>
> In pa_detach_all_error_mq(), we try to detach error queues of all
> workers in the pool. I think we should check if the queue is already
> detached (i.e. is NULL) there. Otherwise, we will end up a SEGV if an
> error happens after detaching the error queue and before removing the
> worker from the pool.
>

Agreed, I have made this change, added the same check at one other
place for the sake of consistency, and pushed the patch.

--
With Regards,
Amit Kapila.