RE: Parallel Apply - Mailing list pgsql-hackers

From Zhijie Hou (Fujitsu)
Subject RE: Parallel Apply
Date
Msg-id TY4PR01MB169076B157207AC056B1EA54F94D4A@TY4PR01MB16907.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Parallel Apply  (Tomas Vondra <tomas@vondra.me>)
List pgsql-hackers
On Thursday, November 20, 2025 5:31 AM Tomas Vondra <tomas@vondra.me> wrote:
> 
> Hello Kuroda-san,
> 
> On 11/18/25 12:00, Hayato Kuroda (Fujitsu) wrote:
> > Dear Amit,
> >
> >> It seems you haven't sent the patch that preserves commit order or the
> >> commit message of the attached patch is wrong. I think the first patch
> >> in series should be the one that preserves commit order and then we
> >> can build a patch that tracks dependencies and allows parallelization
> >> without preserving commit order.
> >
> > I think I attached the correct file. Since we are trying to preserve
> > the commit order by default, everything was merged into one patch.
> 
...

> 
> However, the patch seems fairly large (~80kB, although a fair bit of
> that is comments). Would it be possible to split it into smaller chunks?
> Is there some "minimal patch", which could be moved to 0001, and then
> followed by improvements in 0002, 0003, ...? I sometimes do some
> "infrastructure" first, and the actual patch in the last part (simply
> using the earlier parts).
> 
> I'm not saying it has to be split (or how exactly), but I personally
> find smaller patches easier to review ...

Agreed and thanks for the suggestion, we will try to split the patches into
smaller ones.

> 
> > One point to clarify is that dependency tracking is essential even if we fully
> > preserve the commit ordering not to violate constrains like PK. Assuming
> there is
> > a table which has PK, txn1 inserts a tuple and txn2 updates it. UPDATE
> statement
> > in txn2 must be done after committing txn1.
> >
> 
> Right. I don't see how we could do parallel apply correct in general
> case without tracking these dependencies.
> 
> >> I feel it may be better to just
> >> discuss preserve commit order patch that also contains some comments
> >> as to how to extend it further, once that is done, we can do further
> >> discussion of the other patch.
> >
> > I do agree, let me implement one by one.
> >
> 
> Some comments / questions after looking at the patch today:

Thanks for the comments!

> 1) The way the patch determines dependencies seems to be the "writeset"
> approach from other replication systems (e.g. MySQL does that). Maybe we
> should stick to the same naming?
 
OK, I did not research the design in MySQL in detail but will try to analyze it.
 
> 2) If I understand correctly, the patch maintains a "replica_identity" hash
> table, with replica identity keys for all changes for all concurrent
> transactions. How expensive can this be, in terms of CPU and memory? What if I
> have multiple large batch transactions, each updating millions of rows?
 
In case TPC-B or simple-update the cost of dependency seems trivial (e.g., the
data in profile of previous simple-update test shows
--1.39%--check_dependency_on_replica_identity), but we will try to analyze more
for large transaction cases as suggested.
 
>
> 3) Would it make sense to use some alternative data structure? A bloom filter,
> for example. Just a random idea, not sure if that's a good fit.
 
It's worth analyzing. We will do some more tests and if we find some bottlenecks
due to the current dependency tracking, then we will research more on
alternative approaches like bloom filter.
 
>
> 4) I've seen the benchmarks posted a couple days ago, and I'm running some
> tests myself. But it's hard to say if the result is good or bad without
> knowing what fraction of transactions finds a dependency and has to wait for
> an earlier one. Would it be possible to track this somewhere? Is there a
> suitable pg_stats_ view?
 
Right, we will consider this idea and will try to implement this.
 
>
> 5) It's not clear to me how did you measure the TPS in your benchmark. Did you
> measure how long it takes for the standby to catch up, or what did you do?
 
The test we shared has enabled synchronous logical replication and then use pgbench
(simple-update) to write on the publisher and count the TPS output by pgbench.
 
>
> 6) Did you investigate why the speedup is just ~2.1 with 4 workers, i.e. about
> half of the "ideal" speedup? Is it bottlenecked on WAL, leader having to
> determine dependencies, or something else?
>
> 7) I'm a bit confused about the different types of dependencies, and at which
> point they make the workers wait. There are the dependencies due to modifying
> the same row, in which case the worker waits before starting to apply the
> changes that hits the dependency. And then there's a dependency to enforce
> commit order, in which case it waits before commit. Right? Or did I get that
> wrong?
 
Right, your understanding is correct, there are only two dependencies for now
(same row modification and commit order)
 
>
> 8) The commit message says:
>
> > It would be challenge to check the dependency if the table has user defined
> > trigger or constraints. the most viable solution might be to disallow
> > parallel apply for relations whose triggers and constraints are not marked
> > as parallel-safe or immutable.
>
> Wouldn't this have similar issues with verifying these features on partitioned
> tables as the patch that attempted to allow parallelism for INSERT ... SELECT
> [1]? AFAICS it was too expensive to do with large partitioning hierarchies.
 
By default, since publish_via_partition_root is set to false in the publication,
we normally replicate changes to the leaf partition directly. So, for
non-partitioned tables, we can directly assess their parallel safety and cache
the results.
 
Partitioned tables require additional handling. But unlike INSERT ... SELECT,
logical replication provides remote data changes upfront, allowing us to
identify the target leaf partition for each change and assess safety for that
table. So, we can avoid examining all partition hierarchies for a change.
 
To check the safety for a change on partitioned table, the leader worker could
initially perform tuple routing for the remote change and evaluate the
user-defined triggers or functions in the target partition before determining
whether to parallelize the transaction. Although this approach may introduce
some overhead for the leader, we plan to test its impact. If the overhead is
unacceptable, we might also consider disallowing parallelism for changes on
partitioned tables.
 
>
> 9) I think it'd be good to make sure the "design" comments explain how the new
> parts work in more detail. For example, the existing comment at the beginning
> of applyparallelworker.c goes into a lot of detail, but the patch adds only
> two fairly short paragraphs. Even the commit message has more detail, which
> seems a bit strange.
 
Agreed, we will add more comments.
 
>
> 10) For example it would be good to explain what "internal dependency" and
> "internal relation" are for. I think I understand the internal dependency, I'm
> still not quite sure why we need internal relation (or rather why we didn't
> need it before).
 
The internal relation is used to share relation information (such as the
publisher's table name, schema name, relkind, column names, etc) with parallel
apply workers. This information is needed for verifying whether the publisher's
relation data aligns with the subscriber's data when applying changes.
 
Previously, sharing this information wasn't necessary because parallel apply
workers were only tasked with applying streamed replication. In those cases, the
relation information for modified relations was always sent within streamed
transactions (see maybe_send_schema() for details), eliminating the need for
additional sharing. However, in non-streaming transactions, relation information
might not be included in every transaction. Therefore, we request the leader to
distribute the received relation information to parallel apply workers before
assigning them a transaction.
 
>
> 11) I think it might be good to have TAP tests that stress this out in various
> ways. Say, a test that randomly restarts the standby during parallel apply,
> and checks it does not miss any records, etc. In the online checksums patch
> this was quite useful. It wouldn't be part of regular check-world, of course.
> Or maybe it'd be for development only?
 
We will think more on this.

Best Regards,
Hou zj

pgsql-hackers by date:

Previous
From: Haruna Miwa
Date:
Subject: Re: Add MERGE and VALUES statements to tab completion for PREPARE
Next
From: jian he
Date:
Subject: Re: ON CONFLICT DO SELECT (take 3)