RE: Time delayed LR (WAS Re: logical replication restrictions) - Mailing list pgsql-hackers

From Hayato Kuroda (Fujitsu)
Subject RE: Time delayed LR (WAS Re: logical replication restrictions)
Date
Msg-id TYAPR01MB586688F1D7FFAA0D2D3C3720F5BA9@TYAPR01MB5866.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Time delayed LR (WAS Re: logical replication restrictions)  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
Dear hackers,

Based on the discussion Sawada-san pointed out[1] that the current approach of
logical time-delayed avoids recycling WALs, I'm planning to close the CF entry once.
This or the forked thread will be registered again after deciding on the alternative
approach. Thank you very much for the time to join our discussions earlier.

I think to solve the issue, logical changes must be flushed on subscribers once
and workers apply changes after spending a specified time. The straightforward
approach for it is following physical replication - introduce the walreceiver process
on the subscriber. We must research more, but at least there are some benefits:

* Publisher can be shutted down even if the apply worker stuck. The stuck is more
  likely happen than physical replication, so this may improve the robustness.
  More detail, please see another thread[2].
* In case of synchronous_commit = 'remote_write', publisher can COMMIT faster.
  This is because walreceiver will flush changes immediately and reply soon.
  Even if time-delayed is enabled, the wait-time will not be increased.
* May be used as an infrastructure of parallel apply for non-streaming transaction.
  The basic design of them are the similar - one process receive changes and others apply.

I searched old discussions [3] and wiki pages, and I found that the initial prototype
had a logical walreceiver but in a later version [4] apply worker directly received
changes. I could not find the reason for the decision, but I suspect there were the
following reasons. Could you please tell me the correct background about that?

* Performance bottlenecks. If the walreceiver flush changes and the worker applies
  them, fsync() is called for every reception.
* Complexity. In this design walreceiver and apply worker must share the progress
  of flush/apply. For crash recovery, more consideration is needed. The related discussion
  can be found in [5].
* Extendibility. In-core logical replication should be a sample of an external
  project. Apply worker is just a background worker that can be launched from an extension,
  so it can be easily understood. If it deeply depends on the walreceiver, other projects cannot follow.

[1]: https://www.postgresql.org/message-id/CAD21AoAeG2%2BRsUYD9%2BmEwr8-rrt8R1bqpe56T2D%3DeuO-Qs-GAg%40mail.gmail.com
[2]:
https://www.postgresql.org/message-id/flat/TYAPR01MB586668E50FC2447AD7F92491F5E89%40TYAPR01MB5866.jpnprd01.prod.outlook.com
[3]: https://www.postgresql.org/message-id/201206131327.24092.andres%402ndquadrant.com
[4]: https://www.postgresql.org/message-id/37e19ad5-f667-2fe2-b95b-bba69c5b6c68@2ndquadrant.com
[5]: https://www.postgresql.org/message-id/1339586927-13156-12-git-send-email-andres%402ndquadrant.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

pgsql-hackers by date:

Previous
From: Önder Kalacı
Date:
Subject: Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
Next
From: Pavel Luzanov
Date:
Subject: Re: psql: Add role's membership options to the \du+ command