Re: Perform streaming logical transactions by background workers and parallel apply - Mailing list pgsql-hackers

From Peter Smith
Subject Re: Perform streaming logical transactions by background workers and parallel apply
Date
Msg-id CAHut+Ps0HXawMD=zQ5YUncc9kjGy+md_39Y4Fdf=sKjt-LE92g@mail.gmail.com
Whole thread Raw
In response to RE: Perform streaming logical transactions by background workers and parallel apply  ("houzj.fnst@fujitsu.com" <houzj.fnst@fujitsu.com>)
Responses Re: Perform streaming logical transactions by background workers and parallel apply  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
Hi, here are my review comments for the patch v39-0001

======

src/backend/libpq/pqmq.c

1. mq_putmessage

+ if (IsParallelWorker())
+ SendProcSignal(pq_mq_parallel_leader_pid,
+    PROCSIG_PARALLEL_MESSAGE,
+    pq_mq_parallel_leader_backend_id);
+ else
+ {
+ Assert(IsLogicalParallelApplyWorker());
+ SendProcSignal(pq_mq_parallel_leader_pid,
+    PROCSIG_PARALLEL_APPLY_MESSAGE,
+    pq_mq_parallel_leader_backend_id);
+ }

The generically named macro (IsParallelWorker) makes it seem like a
parallel apply worker is NOT a kind of parallel worker (e.g. it is in
the 'else'), which seems odd. But I am not sure what you can do to
improve this... e.g. reversing the if/else might look logically saner,
but might also be less efficient for the IsParallelWorker case (??)

======

.../replication/logical/applyparallelworker.c

2. LogicalParallelApplyLoop

+ /* Ensure we are reading the data into our memory context. */
+ (void) MemoryContextSwitchTo(ApplyMessageContext);

Why did you use the (void) cast for this MemoryContextSwitchTo but not
for the next one later in the same function?

~~~

3.

+ if (len == 0)
+ break;

As mentioned in my previous review ([1] #3), we are still in the
ApplyMessageContext here. Shouldn't the code be switching to the
previous context before escaping from the loop?

~~~

4.

+ switch (shmq_res)
+ {
+ case SHM_MQ_SUCCESS:
+ {
+ StringInfoData s;
+ int c;
+
+ if (len == 0)
+ break;

I think this introduces a subtle bug.

IIUC the intent of the "break" when len == 0 is to escape from the
loop. But now, this will only break from the switch case. So, it looks
like you need some kind of loop "done" flag, or maybe have to revert
back to using if/else to fix this.

~~~

5.

+ /*
+ * The first byte of message for additional communication between
+ * leader apply worker and parallel apply workers can only be 'w'.
+ */
+ c = pq_getmsgbyte(&s);

Why does it refer to "additional communication"? Isn’t it enough just
to say something like below:

SUGGESTION
The first byte of messages sent from leader apply worker to parallel
apply workers can only be 'w'.

~~~

src/backend/replication/logical/worker.c

6. apply_handle_stream_start

+ *
+ * XXX We can avoid sending pairs of the START/STOP messages to the parallel
+ * worker because unlike apply worker it will process only one transaction at a
+ * time. However, it is not clear whether that is worth the effort because
+ * these messages are sent after logical_decoding_work_mem changes.
  */
 static void
 apply_handle_stream_start(StringInfo s)


I don't know what the "changes" part means. IIUC, the meaning of the
last sentence is like below:

SUGGESTION
However, it is not clear whether any optimization is worthwhile
because these messages are sent only when the
logical_decoding_work_mem threshold is exceeded.

~~~

7. get_transaction_apply_action

> 12. get_transaction_apply_action
>
> I still felt like there should be some tablesync checks/comments in
> this function, just for sanity, even if it works as-is now.
>
> For example, are you saying ([3] #22b) that there might be rare cases
> where a Tablesync would call to parallel_apply_find_worker? That seems
> strange, given that "for streaming transactions that are being applied
> in the parallel ... we disallow applying changes on a table that is
> not in the READY state".
>
> ------

Houz wrote [2] -

I think because we won't try to start parallel apply worker in table sync
worker(see the check in parallel_apply_can_start()), so we won't find any
worker in parallel_apply_find_worker() which means get_transaction_apply_action
will return TRANS_LEADER_SERIALIZE. And get_transaction_apply_action is a
function which can be invoked for all kinds of workers(same is true for all
apply_handle_xxx functions), so not sure if table sync check/comment is
necessary.

~

Sure, and I believe you when you say it all works OK - but IMO there
is something still not quite right with this current code. For
example,

e.g.1 the functional will return TRANS_LEADER_SERIALIZE for Tablesync
worker, and yet the comment for TRANS_LEADER_SERIALIZE says "means
that we are in the leader apply worker" (except we are not)

e.g.2 we know for a fact that Tablesync workers cannot start their own
parallel apply workers, so then why do we even let the Tablesync
worker make a call to parallel_apply_find_worker() looking for
something we know will not be found?

------
[1] My review of v38-0001 -
https://www.postgresql.org/message-id/CAHut%2BPsY0aevdVqeCUJOrRQMrwpg5Wz3-Mo%2BbU%3DmCxW2%2B9EBTg%40mail.gmail.com
[2] Houz reply for my review v38 -
https://www.postgresql.org/message-id/OS0PR01MB5716D738A8F27968806957B194289%40OS0PR01MB5716.jpnprd01.prod.outlook.com

Kind Regards,
Peter Smith.
Fujitsu Australia



pgsql-hackers by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: Standby recovers records from wrong timeline
Next
From: Dilip Kumar
Date:
Subject: Re: problems with making relfilenodes 56-bits