Thread: Synchronous replication, network protocol
The protocol between primary and standby haven't been discussed or documented in detail. I don't think it's enough to just stream WAL as it's generated, so here's my proposal. Messages marked with "(later)" are for features that have been discussed, but no one is implementing for 8.4. The messages are sent like in the frontend/backend protocol. The handshake can work like in the current patch, although I don't think we need or should allow running regular queries before entering "replication mode". the backend should become a walsender process directly after authentication. Standby -> primary RequestWAL <begin> <end>Primary should respond with a WALRange message containing the given range of WAL data. StartReplication <begin>Primary should send all already-generated WAL beginning from <begin>, and then keep sending as it's generated. ReplicatedUpTo <end>Acknowledge that all WAL up to <end> has been successfully received and written to disk and/or fsync'd (depending on the replication mode in use). The primary can use this information to acknowledge a transaction as committed to the client in case of synchronous replication. (later) OldestXmin <xid>When a hot standby server is running read-only queries, indicates the current OldestXmin on the standby. The primary can refrain from vacuuming tuples still required by the slave using this value, if so configured. That will ensure that the standby doesn't need to stall WAL application because of read-only queries. (later) RequestBaseBackupRequest a new base backup to be sent. This can be used to initialize a new slave. Primary -> standby WALRange <begin> <end> <data>Response to RequestWAL or StartReplication message. After receiving a StartReplication message, the primary can send these messages when it feels like it. In synchronous mode, that would be at least at each commit. The standby should respond with a ReplicatedUpTo message to each WALRange message. (later) BaseBackup <data>A base backup, in response to RequestBaseBackup message. For example, in .tar.gz format. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, 2008-12-23 at 18:23 +0200, Heikki Linnakangas wrote: > (later) OldestXmin <xid> > When a hot standby server is running read-only queries, > indicates the > current OldestXmin on the standby. The primary can refrain from > vacuuming tuples still required by the slave using this value, if so > configured. This is all reading like you are relaying someone else's thoughts, or that of a committee. The above is the exact opposite of your position on 11 Sep, where you said having a matching xmin between primary and standby "makes an awful solution for high availability" which Richard, Greg, Robert at least agreed explicitly with. I *am* happy to rediscuss this aspect, because I think you may now see the problems with what people had earlier ruled out. But it would be good to understand why the 180 degree manoeuvre before we start coding up protocol changes. > That will ensure that the standby doesn't need to stall WAL > application because of read-only queries. It doesn't need to. That is already optional. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Tue, 2008-12-23 at 18:23 +0200, Heikki Linnakangas wrote: > I don't think we need or should > allow running regular queries before entering "replication mode". the > backend should become a walsender process directly after authentication. +1 > Standby -> primary > > RequestWAL <begin> <end> > Primary should respond with a WALRange message containing the given > range of WAL data. > > StartReplication <begin> > Primary should send all already-generated WAL beginning from <begin>, > and then keep sending as it's generated. Can you give a quick example of how these would be used? Fujii-san and others considered that having replication start early was an important requirement. If we do these operations serially on the same connection * copy all bulk data * start streaming then there is a considerable delay before replication can begin. In the case of some large sites, perhaps as long as 18-24 hrs. > ReplicatedUpTo <end> > Acknowledge that all WAL up to <end> has been successfully received and > written to disk and/or fsync'd (depending on the replication mode in > use). The primary can use this information to acknowledge a transaction > as committed to the client in case of synchronous replication. +1 > Primary -> standby > > WALRange <begin> <end> <data> > Response to RequestWAL or StartReplication message. After receiving a > StartReplication message, the primary can send these messages when it > feels like it. In synchronous mode, that would be at least at each > commit. The standby should respond with a ReplicatedUpTo message to each > WALRange message. +1 > (later) RequestBaseBackup > Request a new base backup to be sent. This can be used to initialize a > new slave. > (later) BaseBackup <data> > A base backup, in response to RequestBaseBackup message. For example, > in .tar.gz format. Experience from Slony shows that single-threading the initial data send is not a great idea for large databases, since it limits the bandwidth even if you have more available. (Slony has no choice because of the current single-transaction=> single-thread requirement). Being able to take a base backup in parallel is an important feature with large databases. I think we need to offer an option here rather than force use of a single thread, though that may be a more convenient option for many people I would agree. Rumour has it that Slony might move towards a synchronisation that used a base backup and PITR as its starting point. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Hi, Thanks for clarifying! On Wed, Dec 24, 2008 at 2:53 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Tue, 2008-12-23 at 18:23 +0200, Heikki Linnakangas wrote: > >> I don't think we need or should >> allow running regular queries before entering "replication mode". the >> backend should become a walsender process directly after authentication. > > +1 OK, I will re-examine it. But, at least, we need to send ReadyForQuery message after authentication before sending WAL, because walreceiver uses libpq (PQsetdbLogin), which doesn't return until receiving ReadyForQuery. > >> Standby -> primary >> >> RequestWAL <begin> <end> >> Primary should respond with a WALRange message containing the given >> range of WAL data. >> >> StartReplication <begin> >> Primary should send all already-generated WAL beginning from <begin>, >> and then keep sending as it's generated. > > Can you give a quick example of how these would be used? > > Fujii-san and others considered that having replication start early was > an important requirement. If we do these operations serially on the same > connection > * copy all bulk data > * start streaming > then there is a considerable delay before replication can begin. In the > case of some large sites, perhaps as long as 18-24 hrs. Agreed. In very busy system, if those operations are performed serially, we might not be able to start streaming. I mean that the speed to generate WAL might be higher than that to copy them. > >> ReplicatedUpTo <end> >> Acknowledge that all WAL up to <end> has been successfully received and >> written to disk and/or fsync'd (depending on the replication mode in >> use). The primary can use this information to acknowledge a transaction >> as committed to the client in case of synchronous replication. > > +1 Yes. > >> Primary -> standby >> >> WALRange <begin> <end> <data> >> Response to RequestWAL or StartReplication message. After receiving a >> StartReplication message, the primary can send these messages when it >> feels like it. In synchronous mode, that would be at least at each >> commit. The standby should respond with a ReplicatedUpTo message to each >> WALRange message. > > +1 Currently, <begin> is not sent because it can be calculated from <end> and data length. This would decrease a network traffic in some degree. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Simon Riggs wrote: > On Tue, 2008-12-23 at 18:23 +0200, Heikki Linnakangas wrote: >> (later) OldestXmin <xid> >> When a hot standby server is running read-only queries, >> indicates the >> current OldestXmin on the standby. The primary can refrain from >> vacuuming tuples still required by the slave using this value, if so >> configured. > > This is all reading like you are relaying someone else's thoughts, or > that of a committee. No, I can assure you all the confusing words are from my head only :-). > The above is the exact opposite of your position on 11 Sep, where you > said having a matching xmin between primary and standby "makes an awful > solution for high availability" which Richard, Greg, Robert at least > agreed explicitly with. It does, for high availability. There's other use cases where it might be desired (spreading load of read-only queries across servers). And a softer version where the master only respects the slaves OldestXmin up to a point is a good compromise for high availability setups too. I haven't seen any one-size-fits-all solution to this issue, so we have to cater for many. Note that I proposed this exact scheme, where the slave sends its OldestXmin to the master, at the bottom of that same email. >> That will ensure that the standby doesn't need to stall WAL >> application because of read-only queries. > > It doesn't need to. That is already optional. Oh right. I should've added, "without having to kill queries". -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, 2008-12-29 at 13:02 +0200, Heikki Linnakangas wrote: > I haven't seen any one-size-fits-all solution to this issue, so we > have to cater for many. Very much agree. I've had the chance to speak to many people about the way they would like this to work and there is definitely no consensus from those users. So a variety of approaches is appropriate. > Note that I proposed this exact scheme, where the > slave sends its OldestXmin to the master, at the bottom of that same > email. Anyway, as long as it is optional, I see no problem in including it, since we have other mechanisms to choose from and nobody is forced to use this. The design/implementation for this is fairly easy, I think. The difficulty is arriving at an easy-to-use control mechanism that is also secure. The options for handling a conflict are these: 1. Ignore the conflict (and allow silent wrong answers) 2. Allow the conflicting query to progress until it sees changed data 3. Cancel the query 4. Prevent applying WAL 5. Feed OldestXmin back to primary to prevent conflicting WAL The current mechanism is (4) for up to max_standby_delay, then (3). (4) and (5) are both system wide effects: (4) system wide effect on the standby and (5) is a system wide effect on primary. In both of those cases that option should be super-user only controlled. I would be unhappy to think that a normal standby user could create difficult-to-diagnose problems on primary. So I see a problem in making (5) optional and super-user controlled. One way around this is to have the option turn on|off via a function, which can then be granted to other users. That for me is beginning to sound fairly ugly: difficult to understand and difficult to use. But I see some people might want that in certain circumstances. So I guess we should build it. Any good ideas for the control mechanism? I now think we should provide (2) as well, in addition to this. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > The difficulty is arriving at an easy-to-use control mechanism that is > also secure. > > The options for handling a conflict are these: > 1. Ignore the conflict (and allow silent wrong answers) > 2. Allow the conflicting query to progress until it sees changed data > 3. Cancel the query > 4. Prevent applying WAL > 5. Feed OldestXmin back to primary to prevent conflicting WAL > > The current mechanism is (4) for up to max_standby_delay, then (3). > > (4) and (5) are both system wide effects: (4) system wide effect on the > standby and (5) is a system wide effect on primary. In both of those > cases that option should be super-user only controlled. I would be > unhappy to think that a normal standby user could create > difficult-to-diagnose problems on primary. > > So I see a problem in making (5) optional and super-user controlled. > > One way around this is to have the option turn on|off via a function, > which can then be granted to other users. > > That for me is beginning to sound fairly ugly: difficult to understand > and difficult to use. But I see some people might want that in certain > circumstances. So I guess we should build it. Any good ideas for the > control mechanism? Using functions seems overly complicated. Since xids are system-wide, I don't see much value in specifying them at any finer level, or in allowing them for non-superusers. GUC seems like the natural choice. I think the options you have in the patch now, and max_standby_delay to control it, is enough for this release. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, 2008-12-30 at 14:40 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > > > That for me is beginning to sound fairly ugly: difficult to understand > > and difficult to use. But I see some people might want that in certain > > circumstances. So I guess we should build it. Any good ideas for the > > control mechanism? > > Using functions seems overly complicated. We agree on that. > Since xids are system-wide, I > don't see much value in specifying them at any finer level, or in > allowing them for non-superusers. GUC seems like the natural choice. Well, GUCs have security implications that I'm not happy about. I will relent if you will vouch for that decision. "standby_xmin_on_primary" (boolean) - a USERSET GUC that only has meaning during standby query execution. <name> specifies whether the current standby session's xmin is included in the calculation of OldestXmin on the *primary* node. If this parameter is true then the standby query will never be cancelled because of conflicts between the activity of the primary and standby (see discussion in chapter XXXX). The downside of using this parameter is that standby queries can cause table bloat on the primary (see chapter Data Maintenance for more detail). "standby_xmin_on_primary" - new name sought. I think it should begin with "standby_" to remind us that it only effects standby query processing. Implementation: WALReceiver will send message back to WALSender. WALSender will update a single 4 byte value, RemoteXmin that is read during GetSnapshotData(). Updating value will not hold a lock, just as xid is not locked when setting new value. We add a boolean to each proc: SendRemoteXmin. When we run GetSnapshotData() if our own proc has SendRemoteXmin set then we calculate RemoteXmin from the minimum of any proc with SendRemoteXmin set. When we release our snapshot we re-calculate RemoteXmin so that the primary node suffers as little delay as possible in receiving updates to xmin. I'll begin work on this once sync rep is committed. It's about 3-5 days work, but no point in writing it yet because the sand will shift underneath it too much in the next few weeks. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Mon, 2008-12-29 at 13:02 +0200, Heikki Linnakangas wrote: > >> That will ensure that the standby doesn't need to stall WAL > >> application because of read-only queries. > > > > It doesn't need to. That is already optional. > > Oh right. I should've added, "without having to kill queries". Even killing queries is optional, though it will need help from external filesystem level snapshot feature. -------------- Hannu