Thread: Applying logical replication changes by more than one process
Hi, I am trying to use logical replication mechanism in implementation of PostgreSQL multimaster and faced with one conceptualproblem. Originally logical replication was intended to support asynchronous replication. In this case applying changes by singleprocess should not be a bottleneck. But if we are using distributed transaction manager to provide global consistency, then applying transaction by one worker leads to very bad performance and what is worser: cause unintended serialization of transactions, which is not takenin account by distributed deadlock detection algorithm and so can cause undetected deadlocks. So I have implemented pool of background workers which can apply transactions concurrently. It works and shows acceptable performance. But now I am thinking about HA and tracking origin LSNs which are needed to correctlyspecify slot position in case of recovery. And there is a problem: as far as I understand to correctly record originLSN in WAL and advance slot position it is necessary to setup session using replorigin_session_setup. It is not so convenient in case of using pool of background workers, because we have to setupsession for each commit. But the main problem is that for each slot session can be associated only with one process: else if (curstate->acquired_by != 0) { ereport(ERROR, (errcode(ERRCODE_OBJECT_IN_USE), errmsg("replication identifier %d is already active for PID %d", curstate->roident, curstate->acquired_by))); } Which once again means that there can be only one process applying changes. To provide correct state of replication node it is necessary to enforce that each logical replication record is replayedexactly once: we should not loose some change or try to apply it twice. So operation of recording original LSN positionin WAL and adjusting slot should be atomic. And during recovery we should restore slot current position based on the origin values extracted from WAL. I wonder if it can be done using current logical replication mechanism when changes of each slot areapplied by more than one process? Or the only alternative is to write/read origin LSNs in WAL myself, for example using custom WAL records? Thanks in advance! -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Hi, On 19/03/16 11:46, Konstantin Knizhnik wrote: > Hi, > > I am trying to use logical replication mechanism in implementation of > PostgreSQL multimaster and faced with one conceptual problem. > Originally logical replication was intended to support asynchronous > replication. In this case applying changes by single process should not > be a bottleneck. > But if we are using distributed transaction manager to provide global > consistency, then applying transaction by one worker leads to very bad > performance and what is worser: cause unintended serialization of > transactions, which is not taken in account by distributed deadlock > detection algorithm and so can cause > undetected deadlocks. > > So I have implemented pool of background workers which can apply > transactions concurrently. > It works and shows acceptable performance. But now I am thinking about > HA and tracking origin LSNs which are needed to correctly specify slot > position in case of recovery. And there is a problem: as far as I > understand to correctly record origin LSN in WAL and advance slot > position it is necessary to setup session > using replorigin_session_setup. It is not so convenient in case of using > pool of background workers, because we have to setup session for each > commit. > But the main problem is that for each slot session can be associated > only with one process: > > else if (curstate->acquired_by != 0) > { > ereport(ERROR, > (errcode(ERRCODE_OBJECT_IN_USE), > errmsg("replication identifier %d is already active for > PID %d", > curstate->roident, curstate->acquired_by))); > } > > Which once again means that there can be only one process applying changes. > That's not true, all it means is that you can do replorigin_session_setup for same origin only in one process but you don't need to have it setup for session to update it, the replorigin_advance() works just fine. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 19/03/16 11:46, Konstantin Knizhnik wrote: > Hi, > > I am trying to use logical replication mechanism in implementation of > PostgreSQL multimaster and faced with one conceptual problem. > Originally logical replication was intended to support asynchronous > replication. In this case applying changes by single process should not > be a bottleneck. > But if we are using distributed transaction manager to provide global > consistency, then applying transaction by one worker leads to very bad > performance and what is worser: cause unintended serialization of > transactions, which is not taken in account by distributed deadlock > detection algorithm and so can cause > undetected deadlocks. > > So I have implemented pool of background workers which can apply > transactions concurrently. > It works and shows acceptable performance. But now I am thinking about > HA and tracking origin LSNs which are needed to correctly specify slot > position in case of recovery. And there is a problem: as far as I > understand to correctly record origin LSN in WAL and advance slot > position it is necessary to setup session > using replorigin_session_setup. It is not so convenient in case of using > pool of background workers, because we have to setup session for each > commit. > But the main problem is that for each slot session can be associated > only with one process: > > else if (curstate->acquired_by != 0) > { > ereport(ERROR, > (errcode(ERRCODE_OBJECT_IN_USE), > errmsg("replication identifier %d is already active for > PID %d", > curstate->roident, curstate->acquired_by))); > } > > Which once again means that there can be only one process applying changes. > That's not true, all it means is that you can do replorigin_session_setup for same origin only in one process but you don't need to have it setup for session to update it, the replorigin_advance() works just fine. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 21.03.2016 15:10, Petr Jelinek wrote: > Hi, > > On 19/03/16 11:46, Konstantin Knizhnik wrote: >> Hi, >> >> I am trying to use logical replication mechanism in implementation of >> PostgreSQL multimaster and faced with one conceptual problem. >> Originally logical replication was intended to support asynchronous >> replication. In this case applying changes by single process should not >> be a bottleneck. >> But if we are using distributed transaction manager to provide global >> consistency, then applying transaction by one worker leads to very bad >> performance and what is worser: cause unintended serialization of >> transactions, which is not taken in account by distributed deadlock >> detection algorithm and so can cause >> undetected deadlocks. >> >> So I have implemented pool of background workers which can apply >> transactions concurrently. >> It works and shows acceptable performance. But now I am thinking about >> HA and tracking origin LSNs which are needed to correctly specify slot >> position in case of recovery. And there is a problem: as far as I >> understand to correctly record origin LSN in WAL and advance slot >> position it is necessary to setup session >> using replorigin_session_setup. It is not so convenient in case of using >> pool of background workers, because we have to setup session for each >> commit. >> But the main problem is that for each slot session can be associated >> only with one process: >> >> else if (curstate->acquired_by != 0) >> { >> ereport(ERROR, >> (errcode(ERRCODE_OBJECT_IN_USE), >> errmsg("replication identifier %d is already active for >> PID %d", >> curstate->roident, curstate->acquired_by))); >> } >> >> Which once again means that there can be only one process applying >> changes. >> > > That's not true, all it means is that you can do > replorigin_session_setup for same origin only in one process but you > don't need to have it setup for session to update it, the > replorigin_advance() works just fine. But RecordTransactionCommit is using replorigin_session_advance, not replorigin_advance. And replorigin_session_advance requires that session was setup: void replorigin_session_advance(XLogRecPtr remote_commit, XLogRecPtr local_commit) { Assert(session_replication_state != NULL); } "session_replication_state" is private variable which is set by replorigin_session_setup. But attempt to call replorigin_session_setup from multiple process cause above error. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 21/03/16 13:44, Konstantin Knizhnik wrote: > > > On 21.03.2016 15:10, Petr Jelinek wrote: >> Hi, >> >> On 19/03/16 11:46, Konstantin Knizhnik wrote: >>> Hi, >>> >>> I am trying to use logical replication mechanism in implementation of >>> PostgreSQL multimaster and faced with one conceptual problem. >>> Originally logical replication was intended to support asynchronous >>> replication. In this case applying changes by single process should not >>> be a bottleneck. >>> But if we are using distributed transaction manager to provide global >>> consistency, then applying transaction by one worker leads to very bad >>> performance and what is worser: cause unintended serialization of >>> transactions, which is not taken in account by distributed deadlock >>> detection algorithm and so can cause >>> undetected deadlocks. >>> >>> So I have implemented pool of background workers which can apply >>> transactions concurrently. >>> It works and shows acceptable performance. But now I am thinking about >>> HA and tracking origin LSNs which are needed to correctly specify slot >>> position in case of recovery. And there is a problem: as far as I >>> understand to correctly record origin LSN in WAL and advance slot >>> position it is necessary to setup session >>> using replorigin_session_setup. It is not so convenient in case of using >>> pool of background workers, because we have to setup session for each >>> commit. >>> But the main problem is that for each slot session can be associated >>> only with one process: >>> >>> else if (curstate->acquired_by != 0) >>> { >>> ereport(ERROR, >>> (errcode(ERRCODE_OBJECT_IN_USE), >>> errmsg("replication identifier %d is already active for >>> PID %d", >>> curstate->roident, curstate->acquired_by))); >>> } >>> >>> Which once again means that there can be only one process applying >>> changes. >>> >> >> That's not true, all it means is that you can do >> replorigin_session_setup for same origin only in one process but you >> don't need to have it setup for session to update it, the >> replorigin_advance() works just fine. > > But RecordTransactionCommit is using replorigin_session_advance, not > replorigin_advance. Only when the origin is actually setup for the current session. You need to call the replorigin_advance yourself from your apply code. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On March 21, 2016 2:08:54 PM GMT+01:00, Petr Jelinek <petr@2ndquadrant.com> wrote: >On 21/03/16 13:44, Konstantin Knizhnik wrote: >> >> >> On 21.03.2016 15:10, Petr Jelinek wrote: >>> Hi, >>> >>> On 19/03/16 11:46, Konstantin Knizhnik wrote: >>>> Hi, >>>> >>>> I am trying to use logical replication mechanism in implementation >of >>>> PostgreSQL multimaster and faced with one conceptual problem. >>>> Originally logical replication was intended to support asynchronous >>>> replication. In this case applying changes by single process should >not >>>> be a bottleneck. >>>> But if we are using distributed transaction manager to provide >global >>>> consistency, then applying transaction by one worker leads to very >bad >>>> performance and what is worser: cause unintended serialization of >>>> transactions, which is not taken in account by distributed deadlock >>>> detection algorithm and so can cause >>>> undetected deadlocks. >>>> >>>> So I have implemented pool of background workers which can apply >>>> transactions concurrently. >>>> It works and shows acceptable performance. But now I am thinking >about >>>> HA and tracking origin LSNs which are needed to correctly specify >slot >>>> position in case of recovery. And there is a problem: as far as I >>>> understand to correctly record origin LSN in WAL and advance slot >>>> position it is necessary to setup session >>>> using replorigin_session_setup. It is not so convenient in case of >using >>>> pool of background workers, because we have to setup session for >each >>>> commit. >>>> But the main problem is that for each slot session can be >associated >>>> only with one process: >>>> >>>> else if (curstate->acquired_by != 0) >>>> { >>>> ereport(ERROR, >>>> (errcode(ERRCODE_OBJECT_IN_USE), >>>> errmsg("replication identifier %d is already active >for >>>> PID %d", >>>> curstate->roident, curstate->acquired_by))); >>>> } >>>> >>>> Which once again means that there can be only one process applying >>>> changes. >>>> >>> >>> That's not true, all it means is that you can do >>> replorigin_session_setup for same origin only in one process but you >>> don't need to have it setup for session to update it, the >>> replorigin_advance() works just fine. >> >> But RecordTransactionCommit is using replorigin_session_advance, not >> replorigin_advance. > >Only when the origin is actually setup for the current session. You >need >to call the replorigin_advance yourself from your apply code. That's problematic from a durability POV. -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On 21/03/16 14:15, Andres Freund wrote: > > > On March 21, 2016 2:08:54 PM GMT+01:00, Petr Jelinek <petr@2ndquadrant.com> wrote: >> On 21/03/16 13:44, Konstantin Knizhnik wrote: >>> >>> >>> On 21.03.2016 15:10, Petr Jelinek wrote: >>>> Hi, >>>> >>>> On 19/03/16 11:46, Konstantin Knizhnik wrote: >>>>> Hi, >>>>> >>>>> I am trying to use logical replication mechanism in implementation >> of >>>>> PostgreSQL multimaster and faced with one conceptual problem. >>>>> Originally logical replication was intended to support asynchronous >>>>> replication. In this case applying changes by single process should >> not >>>>> be a bottleneck. >>>>> But if we are using distributed transaction manager to provide >> global >>>>> consistency, then applying transaction by one worker leads to very >> bad >>>>> performance and what is worser: cause unintended serialization of >>>>> transactions, which is not taken in account by distributed deadlock >>>>> detection algorithm and so can cause >>>>> undetected deadlocks. >>>>> >>>>> So I have implemented pool of background workers which can apply >>>>> transactions concurrently. >>>>> It works and shows acceptable performance. But now I am thinking >> about >>>>> HA and tracking origin LSNs which are needed to correctly specify >> slot >>>>> position in case of recovery. And there is a problem: as far as I >>>>> understand to correctly record origin LSN in WAL and advance slot >>>>> position it is necessary to setup session >>>>> using replorigin_session_setup. It is not so convenient in case of >> using >>>>> pool of background workers, because we have to setup session for >> each >>>>> commit. >>>>> But the main problem is that for each slot session can be >> associated >>>>> only with one process: >>>>> >>>>> else if (curstate->acquired_by != 0) >>>>> { >>>>> ereport(ERROR, >>>>> (errcode(ERRCODE_OBJECT_IN_USE), >>>>> errmsg("replication identifier %d is already active >> for >>>>> PID %d", >>>>> curstate->roident, curstate->acquired_by))); >>>>> } >>>>> >>>>> Which once again means that there can be only one process applying >>>>> changes. >>>>> >>>> >>>> That's not true, all it means is that you can do >>>> replorigin_session_setup for same origin only in one process but you >>>> don't need to have it setup for session to update it, the >>>> replorigin_advance() works just fine. >>> >>> But RecordTransactionCommit is using replorigin_session_advance, not >>> replorigin_advance. >> >> Only when the origin is actually setup for the current session. You >> need >> to call the replorigin_advance yourself from your apply code. > > That's problematic from a durability POV. > Huh? How come? -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2016-03-21 14:18:27 +0100, Petr Jelinek wrote: > On 21/03/16 14:15, Andres Freund wrote: > >>Only when the origin is actually setup for the current session. You > >>need > >>to call the replorigin_advance yourself from your apply code. > > > >That's problematic from a durability POV. > > > > Huh? How come? If you use the session mechanism the replication progress is synced with the apply process, even if there are crashes. Crash recovery updates the progress. There's no such interlock with apply otherwise, and I don't see how you can build one with reasonable effort.
On 21/03/16 14:25, Andres Freund wrote: > On 2016-03-21 14:18:27 +0100, Petr Jelinek wrote: >> On 21/03/16 14:15, Andres Freund wrote: >>>> Only when the origin is actually setup for the current session. You >>>> need >>>> to call the replorigin_advance yourself from your apply code. >>> >>> That's problematic from a durability POV. >>> >> >> Huh? How come? > > If you use the session mechanism the replication progress is synced with > the apply process, even if there are crashes. Crash recovery updates the > progress. There's no such interlock with apply otherwise, and I don't > see how you can build one with reasonable effort. > Ah you mean because with wal_log=true the origin advance is in different WAL record than commit? OK yeah you might be one transaction behind then, true. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Mar 21, 2016, at 4:30 PM, Petr Jelinek wrote:
On 21/03/16 14:25, Andres Freund wrote:On 2016-03-21 14:18:27 +0100, Petr Jelinek wrote:On 21/03/16 14:15, Andres Freund wrote:Only when the origin is actually setup for the current session. Youneedto call the replorigin_advance yourself from your apply code.That's problematic from a durability POV.Huh? How come?If you use the session mechanism the replication progress is synced withthe apply process, even if there are crashes. Crash recovery updates theprogress. There's no such interlock with apply otherwise, and I don'tsee how you can build one with reasonable effort.
Ah you mean because with wal_log=true the origin advance is in different WAL record than commit? OK yeah you might be one transaction behind then, true.
It actually means that we can not enforce database consistency. If we do replorigin_advance before commit and then crash happen, then we will loose some changes.
If we call replorigin_advance after commit but crash happen before, then some changes can be applied multiple times. For example we can insert some record twice (if there are no unique constraints).
Look likes the only working scenario is to setup replication session for each commit and use locking to prevent concurrent session setup for the same slot by multiple process, doesn't it?
I have tried it, fortunately it doesn't cause any noticeable performance degradation. But unfortunately can't consider such approach as elegant.
Why it is actually necessary to bind replication slot to process? Why it is not possible to have multiple concurrent sessions for the same slot?
Also I concern about using sequential search for slot location in replorigin_session_setup and many other functions - there is loop through all max_replication_slots.
It seems to be not a problem when number of slots is less than 10. For multimaster this assumption is true - even Oracle RAC rarely has two-digit number of nodes.
But if we want to perform sharding and use logical replication for providing redundancy, then number of nodes and slots can be essentially larger.
I didn't think much about such configuration - may be it possible to propose more efficient mechanism for replication in this case.
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 22 March 2016 at 14:32, konstantin knizhnik <k.knizhnik@postgrespro.ru> wrote:
Ah you mean because with wal_log=true the origin advance is in different WAL record than commit? OK yeah you might be one transaction behind then, true.It actually means that we can not enforce database consistency. If we do replorigin_advance before commit and then crash happen, then we will loose some changes.If we call replorigin_advance after commit but crash happen before, then some changes can be applied multiple times. For example we can insert some record twice (if there are no unique constraints).Look likes the only working scenario is to setup replication session for each commit and use locking to prevent concurrent session setup for the same slot by multiple process, doesn't it?
Yes.
How would you expect it to work if you attempted to replorigin_advance without a session? From multiple concurrent backends?
Parallel apply is complicated business. You have to make sure you apply xacts in an order that's free from deadlocks and from insert/delete anomalies - though you can at least detect those, ERROR that xact and all subsequent ones, and retry. For progress tracking to be consistent and correct you'd have to make sure you committed strictly in the same order as upstream. Just before each commit you can set the origin LSN and advance the replication origin, which will commit atomically along with the commit it confirms. I don't really see the problem.
I have tried it, fortunately it doesn't cause any noticeable performance degradation. But unfortunately can't consider such approach as elegant.Why it is actually necessary to bind replication slot to process? Why it is not possible to have multiple concurrent sessions for the same slot?
Especially since most slot changes LWLock- and/or spinlock-protected already.
The client would have to manage replay confirmations appropriately so that it doesn't confirm past the point where some other connection still needs it.
We'd have to expose a "slot" column in pg_stat_replication and remove the "pid" column from pg_replication_slots to handle the 1:n relationship between slot clients and slots, and it'd be a pain to show which normal user backends were using a slot. Not really sure how to handle that.
To actually make this useful would require a lot more though. A way to request that replay start from a new LSN without a full disconnect/reconnect each time. Client-side parallel consume/apply. Inter-transaction ordering information so the client can work out a viable xact apply order (possibly using SSI information per the discussion with Kevin?). Etc.
I haven't really looked into this and I suspect there are some hairy areas involved in replaying a slot from more than one client. The reason I'm interested in it personally is for initial replica state setup as Oleksandr prototyped and described earlier. We could attach to the slot's initial snapshot then issue a new replication command that, given a table name or oid, scans the table from the snapshot and passes each tuple to a new callback (like, but not the same as, the insert callback) on the output plugin.
That way clients could parallel-copy the initial state of the DB across the same replication protocol they then consume new changes from, with no need to make normal libpq connections and COPY initial state.
I'm interested in being able to do parallel receive of new changes from the slot too, but suspect that'd be a bunch harder.
Also I concern about using sequential search for slot location in replorigin_session_setup and many other functions - there is loop through all max_replication_slots.It seems to be not a problem when number of slots is less than 10. For multimaster this assumption is true - even Oracle RAC rarely has two-digit number of nodes.But if we want to perform sharding and use logical replication for providing redundancy, then number of nodes and slots can be essentially larger.
Sounds like premature optimisation. Deal with it if it comes up in profiles in scale testing with 100 clients. I'll be surprised if it does.
On 22/03/16 07:32, konstantin knizhnik wrote: > > On Mar 21, 2016, at 4:30 PM, Petr Jelinek wrote: > >> On 21/03/16 14:25, Andres Freund wrote: >>> On 2016-03-21 14:18:27 +0100, Petr Jelinek wrote: >>>> On 21/03/16 14:15, Andres Freund wrote: >>>>>> Only when the origin is actually setup for the current session. You >>>>>> need >>>>>> to call the replorigin_advance yourself from your apply code. >>>>> >>>>> That's problematic from a durability POV. >>>>> >>>> >>>> Huh? How come? >>> >>> If you use the session mechanism the replication progress is synced with >>> the apply process, even if there are crashes. Crash recovery updates the >>> progress. There's no such interlock with apply otherwise, and I don't >>> see how you can build one with reasonable effort. >>> >> >> Ah you mean because with wal_log=true the origin advance is in >> different WAL record than commit? OK yeah you might be one transaction >> behind then, true. > > It actually means that we can not enforce database consistency. If we do > replorigin_advance before commit and then crash happen, then we will > loose some changes. > If we call replorigin_advance after commit but crash happen before, then > some changes can be applied multiple times. For example we can insert > some record twice (if there are no unique constraints). > Look likes the only working scenario is to setup replication session for > each commit and use locking to prevent concurrent session setup for the > same slot by multiple process, doesn't it? You can do that, or you can move the tracking to the receiving process and spill the data to the disk (hurts IO obviously), or save the progress to table (also hurts IO), or write patch which solves this (no idea how though). > > Also I concern about using sequential search for slot location > in replorigin_session_setup and many other functions - there is loop > through all max_replication_slots. > It seems to be not a problem when number of slots is less than 10. For > multimaster this assumption is true - even Oracle RAC rarely has > two-digit number of nodes. > But if we want to perform sharding and use logical replication for > providing redundancy, then number of nodes and slots can be essentially > larger. > I didn't think much about such configuration - may be it possible to > propose more efficient mechanism for replication in this case. > And each slot means connection with logical decoding attached to it so you don't really want to have thousands of those anyway. I think you'll hit other problems faster than loop over slots becomes problem if you plan to keep all of them active. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Mar 22, 2016, at 10:10 AM, Craig Ringer wrote:
On 22 March 2016 at 14:32, konstantin knizhnik <k.knizhnik@postgrespro.ru> wrote:Ah you mean because with wal_log=true the origin advance is in different WAL record than commit? OK yeah you might be one transaction behind then, true.It actually means that we can not enforce database consistency. If we do replorigin_advance before commit and then crash happen, then we will loose some changes.If we call replorigin_advance after commit but crash happen before, then some changes can be applied multiple times. For example we can insert some record twice (if there are no unique constraints).Look likes the only working scenario is to setup replication session for each commit and use locking to prevent concurrent session setup for the same slot by multiple process, doesn't it?Yes.How would you expect it to work if you attempted to replorigin_advance without a session? From multiple concurrent backends?
Please excuse me, I am not thinking now about the general case of using logical replication, I just focused on multimaster.
What I need is some efficient, durable and atomic mechanism for applying changes.
I see only two ways to provide atomicity:
1. Tracking of origins should be done inside xact as part of normal commit.
2. Use custom WAL records.
1) is easier now and it really works if I correctly synchronize access to slots. And surprisingly it even doesn't add substantial overhead.
Parallel apply is complicated business. You have to make sure you apply xacts in an order that's free from deadlocks and from insert/delete anomalies - though you can at least detect those, ERROR that xact and all subsequent ones, and retry.
Well, this is exactly what our multimaster does. We do not try to enforce order of applying xacts. But we detect global deadlocks and use 2PC to provide data consistency.
So it is not task of logical replication, it is done by DTM overriding visibility checks and transaction commit protocol using XTM.
For progress tracking to be consistent and correct you'd have to make sure you committed strictly in the same order as upstream. Just before each commit you can set the origin LSN and advance the replication origin, which will commit atomically along with the commit it confirms. I don't really see the problem.
I have tried it, fortunately it doesn't cause any noticeable performance degradation. But unfortunately can't consider such approach as elegant.Why it is actually necessary to bind replication slot to process? Why it is not possible to have multiple concurrent sessions for the same slot?Especially since most slot changes LWLock- and/or spinlock-protected already.The client would have to manage replay confirmations appropriately so that it doesn't confirm past the point where some other connection still needs it.We'd have to expose a "slot" column in pg_stat_replication and remove the "pid" column from pg_replication_slots to handle the 1:n relationship between slot clients and slots, and it'd be a pain to show which normal user backends were using a slot. Not really sure how to handle that.To actually make this useful would require a lot more though. A way to request that replay start from a new LSN without a full disconnect/reconnect each time. Client-side parallel consume/apply. Inter-transaction ordering information so the client can work out a viable xact apply order (possibly using SSI information per the discussion with Kevin?). Etc.I haven't really looked into this and I suspect there are some hairy areas involved in replaying a slot from more than one client. The reason I'm interested in it personally is for initial replica state setup as Oleksandr prototyped and described earlier. We could attach to the slot's initial snapshot then issue a new replication command that, given a table name or oid, scans the table from the snapshot and passes each tuple to a new callback (like, but not the same as, the insert callback) on the output plugin.That way clients could parallel-copy the initial state of the DB across the same replication protocol they then consume new changes from, with no need to make normal libpq connections and COPY initial state.I'm interested in being able to do parallel receive of new changes from the slot too, but suspect that'd be a bunch harder.Also I concern about using sequential search for slot location in replorigin_session_setup and many other functions - there is loop through all max_replication_slots.It seems to be not a problem when number of slots is less than 10. For multimaster this assumption is true - even Oracle RAC rarely has two-digit number of nodes.But if we want to perform sharding and use logical replication for providing redundancy, then number of nodes and slots can be essentially larger.Sounds like premature optimisation. Deal with it if it comes up in profiles in scale testing with 100 clients. I'll be surprised if it does.--
On Mar 22, 2016, at 11:14 AM, Petr Jelinek wrote:
And each slot means connection with logical decoding attached to it so you don't really want to have thousands of those anyway. I think you'll hit other problems faster than loop over slots becomes problem if you plan to keep all of them active.
Assume that cluster have thousands of nodes and we use sharding to scatter data through cluster nodes.
But to provide HA we want to perform sharding with some level of redundancy, for example save the same record at 3 different nodes.
Once possible approach (pg_shard) is to execute the same query at three different shards.
But there is no warranty that result of execution will be the same at all nodes.
Alternative approach is to execute transaction at one node and then replicate it using logical replication to replicas.
So we do not perform logical replication to all 1000 nodes. Just to 2 of them. But each time it will be different pair of nodes. So we still need to have 1000 active replication slots.
May be logical replication can not be used at all in such scenario - I have not thought much about it yet. Our first step will be multimaster without sharding.
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers