Thread: Synch Rep for CommitFest 2009-07
Hi, On Fri, Jul 3, 2009 at 1:32 PM, Fujii Masao<masao.fujii@gmail.com> wrote: >> This patch no longer applies cleanly. Can you rebase and resubmit it >> for the upcoming CommitFest? It might also be good to go through and >> clean up the various places where you have trailing whitespace and/or >> spaces preceding tabs. > > Sure. I'll resubmit the patch after fixing some bugs and finishing > the documents. Here is the updated version of Synch Rep patch. I adjusted the patch against CVS HEAD, fixed some bugs and updated the documents. The attached tarball contains some patches which were split to be reviewed easily. Description of each patches, a brief procedure to set up Synch Rep and the functional overview of it are in wiki. http://wiki.postgresql.org/wiki/NTT's_Development_Projects If you notice anything, please feel free to comment! Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Attachment
Fujii Masao wrote: > On Fri, Jul 3, 2009 at 1:32 PM, Fujii Masao<masao.fujii@gmail.com> wrote: >>> This patch no longer applies cleanly. Can you rebase and resubmit it >>> for the upcoming CommitFest? It might also be good to go through and >>> clean up the various places where you have trailing whitespace and/or >>> spaces preceding tabs. >> Sure. I'll resubmit the patch after fixing some bugs and finishing >> the documents. > > Here is the updated version of Synch Rep patch. I adjusted the patch > against CVS HEAD, fixed some bugs and updated the documents. > > The attached tarball contains some patches which were split to be > reviewed easily. Description of each patches, a brief procedure to > set up Synch Rep and the functional overview of it are in wiki. > http://wiki.postgresql.org/wiki/NTT's_Development_Projects > > If you notice anything, please feel free to comment! Here's one little thing in addition to all the stuff already discussed: The only caller that doesn't pass XLogSyncReplication as the new 'mode' argument to XLogFlush is this CreateCheckPoint: *************** *** 6569,6575 **** XLOG_CHECKPOINT_ONLINE, &rdata); ! XLogFlush(recptr); /* * We mustn't write any new WAL after a shutdown checkpoint, or it will --- 7667,7677 ---- XLOG_CHECKPOINT_ONLINE, &rdata); ! /* ! * Don't shutdown until all outstanding xlog records are replicated and ! * fsynced on the standby, regardless of synchronization mode. ! */ ! XLogFlush(recptr, shutdown ? REPLICATION_MODE_FSYNC : XLogSyncReplication); /* * We mustn't write any new WAL after a shutdown checkpoint, or it will If that's the only such caller, let's introduce a new function for that and keep the XLogFlush() api unchanged. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Hi, On Wed, Jul 15, 2009 at 3:56 AM, Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> wrote: > Here's one little thing in addition to all the stuff already discussed: Thanks for the comment! > If that's the only such caller, let's introduce a new function for that > and keep the XLogFlush() api unchanged. OK. How about the following function? ------------------ /** Ensure that shutdown-related XLOG data through the given position is* flushed to local disk, and also flushed to thedisk in the standby* if replication is in progress.*/ void XLogShutdownFlush(XLogRecPtr record) { int save_mode = XLogSyncReplication; XLogSyncReplication = REPLICATION_MODE_FSYNC; XLogFlush(record); XLogSyncReplication = save_mode; } ------------------ In a shutdown checkpoint case, CreateCheckPoint calls XLogShutdownFlush, otherwise XLogFlush. And, XLogFlush uses XLogSyncReplication directly instead of obsolete 'mode' argument. If the above is OK, should I update the patch ASAP? or suspend that update until many other comments arrive? I'm concerned that frequent small updating interferes with a review. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, Jul 15, 2009 at 12:32 AM, Fujii Masao<masao.fujii@gmail.com> wrote: > If the above is OK, should I update the patch ASAP? or > suspend that update until many other comments arrive? > I'm concerned that frequent small updating interferes with > a review. I decided (perhaps foolishly), to assign reviewers for the two smaller patches that you extracted from this first, and to hold off on assigning a reviewer for the main patch until those reviews were completed: http://archives.postgresql.org/message-id/3f0b79eb0907022341m1d36a841x19c3e2a5a6906b5b@mail.gmail.com http://archives.postgresql.org/message-id/3f0b79eb0907030037g515f3337o9092279c62348dc@mail.gmail.com So I think you should update ASAP in this case. As soon as we get some reviewers freed up from the initial reviewing round, I will assign one or more reviewers to the main Sync Rep patch. ...Robert
Hi, On Wed, Jul 15, 2009 at 8:15 PM, Robert Haas<robertmhaas@gmail.com> wrote: > So I think you should update ASAP in this case. I updated the patch as described in http://archives.postgresql.org/pgsql-hackers/2009-07/msg00865.php All the other parts are still the same. > As soon as we get > some reviewers freed up from the initial reviewing round, I will > assign one or more reviewers to the main Sync Rep patch. Thanks! Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Attachment
Fujii Masao wrote: > Hi, > > On Wed, Jul 15, 2009 at 8:15 PM, Robert Haas<robertmhaas@gmail.com> wrote: >> So I think you should update ASAP in this case. > > I updated the patch as described in > http://archives.postgresql.org/pgsql-hackers/2009-07/msg00865.php > > All the other parts are still the same. > >> As soon as we get >> some reviewers freed up from the initial reviewing round, I will >> assign one or more reviewers to the main Sync Rep patch. > > Thanks! I don't think there's much point assigning more reviewers to Synch Rep at this point. I believe we have consensus on four major changes: 1. Change the way synchronization is done when standby connects to primary. After authentication, standby should send a message to primary, stating the <begin> point (where <begin> is an XLogRecPtr, not a WAL segment name). Primary starts streaming WAL starting from that point, and keeps streaming forever. pg_read_xlogfile() needs to be removed. 2. The primary should have no business reading back from the archive. The standby can read from the archive, as it can today. 3. Need to support multiple WALSenders. While multiple slave support isn't 1st priority right now, it's not acceptable that a new WALSender can't connect while one is active already. That can cause trouble in case of network problems etc. 4. It is not acceptable that normal backends have to wait for walsender to send data. That means that connecting a standby behind a slow connection to the primary can grind the primary to a halt. walsender needs to be able to read data from disk, not just from shared memory. (I raised this back in December http://archives.postgresql.org/message-id/495106FA.1050605@enterprisedb.com) Those 4 things are big enough changes that I don't think there's much left to review that won't be affected by those changes. As a hint, I think you'll find it a lot easier if you implement only asynchronous replication at first. That reduces the amount of inter-process communication a lot. You can then add synchronous capability in a later commitfest. I would also suggest that for point 4, you implement WAL sender so that it *only* reads from disk at first, and only add the capability send from wal_buffers later on, and only if performance testing shows that it's needed. I'll move this to "returned with feedback" section, but if you get those things done quickly we can still give it another round of review in this commitfest. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Le 15 juil. 09 à 23:03, Heikki Linnakangas a écrit : > 2. The primary should have no business reading back from the archive. > The standby can read from the archive, as it can today. Sorry to insist, but I'm not sold on your consensus here, yet: http://archives.postgresql.org/pgsql-hackers/2009-07/msg00486.php There's a true need for the solution to be simple to install, and providing a side channel for the standby to go read the archives itself isn't it. Furthermore, the counter-argument against having the primary able to send data from the archives to some standby is that it should still work when primary's dead, but as this is only done in the setup phase, I don't see that being able to continue preparing a not- yet-ready standby against a dead primary is buying us anything. Now, I tried proposing to implement an archive server as a postmaster child to have a reference implementation of an archive command for "basic" cases, and provide the ability to give data from the archive to slave(s). But this is getting too much into the implementation details for my current understanding of them :) Regards, -- dim
Dimitri Fontaine wrote: > Le 15 juil. 09 à 23:03, Heikki Linnakangas a écrit : >> 2. The primary should have no business reading back from the archive. >> The standby can read from the archive, as it can today. > > Sorry to insist, but I'm not sold on your consensus here, yet: > http://archives.postgresql.org/pgsql-hackers/2009-07/msg00486.php > > There's a true need for the solution to be simple to install, and > providing a side channel for the standby to go read the archives itself > isn't it. I think a better way to address that need is to provide a built-in mechanism for the standby to request a base backup and have it sent over the wire. That makes the initial setup very easy. > Furthermore, the counter-argument against having the primary > able to send data from the archives to some standby is that it should > still work when primary's dead, but as this is only done in the setup > phase, I don't see that being able to continue preparing a not-yet-ready > standby against a dead primary is buying us anything. The situation arises also when the standby falls badly behind. A simple solution to that is to add a switch in the master to specify "always keep X MB of WAL in pg_xlog". The standby will then still find it in pg_xlog, making it harder for a standby to fall so much behind that it can't find the WAL it needs in the primary anymore. Tom suggested that we can just give up and re-sync with a new base backup, but that really requires built-in base backup capability, and is only practical for small databases. I think we should definitely have both those features, but it's not urgent. The replication works without them, although requires that you set up traditional archiving as well. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Hi, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > I think a better way to address that need is to provide a built-in > mechanism for the standby to request a base backup and have it sent over > the wire. That makes the initial setup very easy. Great idea :) So I'll reproduce the sketch I did in this other mail, adding the 'base' state where the prerequisite base backup is handled, that will help clarify the next points: 0. base: slave asks the master for a base-backup, at the end of this it reaches the base-lsn 1. init: slave asks the master the current LSN and start streaming WAL 2. setup: slave asks the master for missing WALs from its base-lsn to this LSN it just got, and apply them all to reachinitial LSN (this happens in parallel to 1.) 3. catchup: slave has replayed missing WALs and now is replaying the stream he received in parallel, and which appliesfrom init LSN (just reached) 4. sync: slave is applying the stream as it gets it, either as part of the master transaction or not depending on the GUCsettings > The situation arises also when the standby falls badly behind. A simple > solution to that is to add a switch in the master to specify "always > keep X MB of WAL in pg_xlog". The standby will then still find it in > pg_xlog, making it harder for a standby to fall so much behind that it > can't find the WAL it needs in the primary anymore. Tom suggested that > we can just give up and re-sync with a new base backup, but that really > requires built-in base backup capability, and is only practical for > small databases. I think that when the standby is back in business after a connection glitch (or any other transient error), its current internal state is still 'sync' and walreceiver asks for next LSN (RedoPTR?). Now, 2 cases are possible: a. primary still has it handy, so the standby is still in sync but lagging behind (and primary knows how much) b. primary is not able to provide the requested WAL entry, so the slave is back to 'setup' state, with base-lsn the pointreached just before loosing sync (the one walreceiver just asked for). Now, a standby in 'setup' state isn't ready (yet), and for example synchronous replication won't be possible in this state: we can't ask the primary to refuse to COMMIT any transaction (holding it, eg) while a standby hasn't reached 'sync' state. The way your talking about the issue make me think there's a mix between how to handle a lagging standby and an out-of-sync standby. For clarity, I think we should have very distinct states and responses. And yes, as Tom and you keep saying, a synced standby by definition should not need any access to its primary archives. So if it does, it's no more in sync. > I think we should definitely have both those features, but it's not > urgent. The replication works without them, although requires that you > set up traditional archiving as well. Agreed, it's not essential for the feature as far as hackers are concerned. Regards, -- dim
Hi, On Thu, Jul 16, 2009 at 6:03 AM, Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> wrote: > I don't think there's much point assigning more reviewers to Synch Rep > at this point. I believe we have consensus on four major changes: Thanks for clarifying the issues! Okey, I'll rework the patch. > 1. Change the way synchronization is done when standby connects to > primary. After authentication, standby should send a message to primary, > stating the <begin> point (where <begin> is an XLogRecPtr, not a WAL > segment name). Primary starts streaming WAL starting from that point, > and keeps streaming forever. pg_read_xlogfile() needs to be removed. I assume that <begin> should indicate the location of the last valid record. In other words, at first the standby tries to recover by using only the XLOG files which exist in its archive or pg_xlog. When it has reached the last valid record, it requests the XLOG records which follow <begin> to the primary. Is my understanding OK? http://archives.postgresql.org/pgsql-hackers/2009-07/msg00475.php As I described before, the XLOG file which the standby creates should be recoverable. So, when <begin> indicates the middle of the XLOG file, the primary should start sending the records from the head of the file including <begin>. Is this OK? Or, the primary should start from <begin>? In this case, since we can expect that the incomplete file including <begin> would exist in also the standby, the records following <begin> need to be appended into it. And, if that incomplete file is the restored one from archive, it would need to be renamed from a temporary name before being appended. A timeline/backup history file is also required for recovery, but it's not found in the standby. So, they need to be shipped from the primary, and this capability is provided by pg_read_xlogfile(). If removing the function, how should we transfer those history files? The function similar to pg_read_xlogfile() with which the filename needs to be specified is still necessary? > 2. The primary should have no business reading back from the archive. > The standby can read from the archive, as it can today. In this case, a backup history file should be stored in pg_xlog for a while, because it might be requested by the standby. So far pg_start_backup() has removed the previous backup history file soon. We should introduce a new GUC parameter to determine how many backup history files should exist in pg_xlog? CHECKPOINT should not recycle the XLOG files following the file which is requested by the standby in that moment. So, we need to tweak the recycling policy. > 3. Need to support multiple WALSenders. While multiple slave support > isn't 1st priority right now, it's not acceptable that a new WALSender > can't connect while one is active already. That can cause trouble in > case of network problems etc. Sorry, I didn't get your point. You think multiple slave support isn't 1st priority, and yet why should multiple walsender mechanism be necessary? Can you describe the problem cases in more detail? > 4. It is not acceptable that normal backends have to wait for walsender > to send data. Umm... this is true in asynchronous replication case. Also true while the standby is catching up with the primary. After those servers get into synchronization, the backend should wait for walsender to send data (and also walreceiver to write/fsync data) before returning "success" of COMMIT to the client. Is my understanding right? In current Synch Rep, the backend basically doesn't wait for walsender in asynchronous mode. But only when wal_buffers is filled with unsent data, the backend waits for walsender to send data because there is no room to insert new data. You suggest only that this problem case should be solved? > That means that connecting a standby behind a slow > connection to the primary can grind the primary to a halt. This is the fate of *synchronous* replication, isn't it? If a user want to get around such problem, asynchronous mode should be chosen, I think. > walsender > needs to be able to read data from disk, not just from shared memory. (I > raised this back in December > http://archives.postgresql.org/message-id/495106FA.1050605@enterprisedb.com) OK, I'll try it. > As a hint, I think you'll find it a lot easier if you implement only > asynchronous replication at first. That reduces the amount of > inter-process communication a lot. You can then add synchronous > capability in a later commitfest. I would also suggest that for point 4, > you implement WAL sender so that it *only* reads from disk at first, and > only add the capability send from wal_buffers later on, and only if > performance testing shows that it's needed. Sounds good. I'll advance development in stages as you suggested. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao wrote: > On Thu, Jul 16, 2009 at 6:03 AM, Heikki > Linnakangas<heikki.linnakangas@enterprisedb.com> wrote: >> 1. Change the way synchronization is done when standby connects to >> primary. After authentication, standby should send a message to primary, >> stating the <begin> point (where <begin> is an XLogRecPtr, not a WAL >> segment name). Primary starts streaming WAL starting from that point, >> and keeps streaming forever. pg_read_xlogfile() needs to be removed. > > I assume that <begin> should indicate the location of the last valid record. > In other words, at first the standby tries to recover by using only the XLOG > files which exist in its archive or pg_xlog. When it has reached the last valid > record, it requests the XLOG records which follow <begin> to the primary. > Is my understanding OK? Yes. > http://archives.postgresql.org/pgsql-hackers/2009-07/msg00475.php > As I described before, the XLOG file which the standby creates should be > recoverable. So, when <begin> indicates the middle of the XLOG file, the > primary should start sending the records from the head of the file including > <begin>. Is this OK? > > Or, the primary should start from <begin>? In this case, since we can > expect that the incomplete file including <begin> would exist in also the > standby, the records following <begin> need to be appended into it. I would expect the standby to append to the partial XLOG file. > And, if that incomplete file is the restored one from archive, it would need > to be renamed from a temporary name before being appended. The archive should not normally contain partial XLOG files, only if you manually copy one there after primary has crashed. So I don't think that's something we need to support. > A timeline/backup history file is also required for recovery, but it's not > found in the standby. So, they need to be shipped from the primary, and > this capability is provided by pg_read_xlogfile(). If removing the function, > how should we transfer those history files? The function similar to > pg_read_xlogfile() with which the filename needs to be specified is still > necessary? Hmm. You only need the timeline history file if the base backup was taken in an earlier timeline. That situation would only arise if you (manually) take a base backup, restore to a server (which creates a new timeline), and then create a slave against that server. At least in the 1st phase, I think we can assume that the standby has access to the same archive, and will find the history file from there. If not, throw an error. We can add more bells and whistles later. > CHECKPOINT should not recycle the XLOG files following the file which > is requested by the standby in that moment. So, we need to tweak the > recycling policy. Yep. >> 3. Need to support multiple WALSenders. While multiple slave support >> isn't 1st priority right now, it's not acceptable that a new WALSender >> can't connect while one is active already. That can cause trouble in >> case of network problems etc. > > Sorry, I didn't get your point. You think multiple slave support isn't 1st > priority, and yet why should multiple walsender mechanism be necessary? > Can you describe the problem cases in more detail? As the patch stands, new walsender connections are refused when one is active already. What if the walsender connection is in a zombie state? For example, it's trying to send WAL to the slave, but the network connection is down, and the packets are going to a black hole. It will take a while for the TCP layer to declare the connection dead, and close the socket. During that time, you can't connect a new slave to the master, or the same slave using a better network connection. The most robust way to fix that is to support multiple walsenders. The zombie walsender can take its time to die, while the new walsender serves the new connection. You could tweak SO_TIMEOUTs and stuff, but even then the standby process could be in some weird hung state. And of course, when we get around to add support for multiple slaves, we'll have to do that anyway. Better get it right to begin with. >> 4. It is not acceptable that normal backends have to wait for walsender >> to send data. > > Umm... this is true in asynchronous replication case. Also true while the > standby is catching up with the primary. After those servers get into > synchronization, the backend should wait for walsender to send data (and > also walreceiver to write/fsync data) before returning "success" of COMMIT > to the client. Is my understanding right? Even in synchronous replication, a backend should only have to wait when it commits. You would only see the difference with very large transactions that write more WAL than fits in wal_buffers, though, like data loading. > In current Synch Rep, the backend basically doesn't wait for walsender in > asynchronous mode. But only when wal_buffers is filled with unsent data, > the backend waits for walsender to send data because there is no room to > insert new data. You suggest only that this problem case should be solved? Right, that is the problem. >> That means that connecting a standby behind a slow >> connection to the primary can grind the primary to a halt. > > This is the fate of *synchronous* replication, isn't it? If a user want to get > around such problem, asynchronous mode should be chosen, I think. Right. But as the patch stands, asynchronous mode has the same problem, which is not acceptable. > Sounds good. I'll advance development in stages as you suggested. Thanks! -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Jul 16, 2009, at 12:07 AM, Heikki Linnakangas wrote: > Dimitri Fontaine wrote: >> Le 15 juil. 09 à 23:03, Heikki Linnakangas a écrit : >> Furthermore, the counter-argument against having the primary >> able to send data from the archives to some standby is that it should >> still work when primary's dead, but as this is only done in the setup >> phase, I don't see that being able to continue preparing a not-yet- >> ready >> standby against a dead primary is buying us anything. > > The situation arises also when the standby falls badly behind. A > simple > solution to that is to add a switch in the master to specify "always > keep X MB of WAL in pg_xlog". The standby will then still find it in > pg_xlog, making it harder for a standby to fall so much behind that it > can't find the WAL it needs in the primary anymore. Tom suggested that > we can just give up and re-sync with a new base backup, but that > really > requires built-in base backup capability, and is only practical for > small databases. If you use an rsync like algorithm for doing the base backups wouldn't that increase the size of the database for which it would still be practical to just re-sync? Couldn't you in fact sync a very large database if the amount of actual change in the files was a small percentage of the total size?
Rick Gigger wrote: > If you use an rsync like algorithm for doing the base backups wouldn't > that increase the size of the database for which it would still be > practical to just re-sync? Couldn't you in fact sync a very large > database if the amount of actual change in the files was a small > percentage of the total size? It would certainly help to reduce the network traffic, though you'd still have to scan all the data to see what has changed. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, Jul 16, 2009 at 4:41 PM, Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> wrote: > Rick Gigger wrote: >> If you use an rsync like algorithm for doing the base backups wouldn't >> that increase the size of the database for which it would still be >> practical to just re-sync? Couldn't you in fact sync a very large >> database if the amount of actual change in the files was a small >> percentage of the total size? > > It would certainly help to reduce the network traffic, though you'd > still have to scan all the data to see what has changed. The fundamental problem with pushing users to start over with a new base backup is that there's no relationship between the size of the WAL and the size of the database. You can plausibly have a system with extremely high transaction rate generating WAL very quickly, but where the whole database fits in a few hundred megabytes. In that case you could be behind by only a few minutes and have it be faster to take a new base backup. Or you could have a petabyte database which is rarely updated. In which case it might be faster to apply weeks' worth of logs than to try to take a base backup. Only the sysadmin is actually going to know which makes more sense. Unless we start tieing WAL parameters to the database size or something like that. -- greg http://mit.edu/~gsstark/resume.pdf
On Jul 16, 2009, at 11:09 AM, Greg Stark wrote: > On Thu, Jul 16, 2009 at 4:41 PM, Heikki > Linnakangas<heikki.linnakangas@enterprisedb.com> wrote: >> Rick Gigger wrote: >>> If you use an rsync like algorithm for doing the base backups >>> wouldn't >>> that increase the size of the database for which it would still be >>> practical to just re-sync? Couldn't you in fact sync a very large >>> database if the amount of actual change in the files was a small >>> percentage of the total size? >> >> It would certainly help to reduce the network traffic, though you'd >> still have to scan all the data to see what has changed. > > The fundamental problem with pushing users to start over with a new > base backup is that there's no relationship between the size of the > WAL and the size of the database. > > You can plausibly have a system with extremely high transaction rate > generating WAL very quickly, but where the whole database fits in a > few hundred megabytes. In that case you could be behind by only a few > minutes and have it be faster to take a new base backup. > > Or you could have a petabyte database which is rarely updated. In > which case it might be faster to apply weeks' worth of logs than to > try to take a base backup. > > Only the sysadmin is actually going to know which makes more sense. > Unless we start tieing WAL parameters to the database size or > something like that. Once again wouldn't an rsync like algorithm help here. Couldn't you have the default be to just create a new base backup for them , but then allow you to specify an existing base backup if you've already got one?
On Thu, Jul 16, 2009 at 1:09 PM, Greg Stark<gsstark@mit.edu> wrote: > On Thu, Jul 16, 2009 at 4:41 PM, Heikki > Linnakangas<heikki.linnakangas@enterprisedb.com> wrote: >> Rick Gigger wrote: >>> If you use an rsync like algorithm for doing the base backups wouldn't >>> that increase the size of the database for which it would still be >>> practical to just re-sync? Couldn't you in fact sync a very large >>> database if the amount of actual change in the files was a small >>> percentage of the total size? >> >> It would certainly help to reduce the network traffic, though you'd >> still have to scan all the data to see what has changed. > > The fundamental problem with pushing users to start over with a new > base backup is that there's no relationship between the size of the > WAL and the size of the database. > > You can plausibly have a system with extremely high transaction rate > generating WAL very quickly, but where the whole database fits in a > few hundred megabytes. In that case you could be behind by only a few > minutes and have it be faster to take a new base backup. > > Or you could have a petabyte database which is rarely updated. In > which case it might be faster to apply weeks' worth of logs than to > try to take a base backup. > > Only the sysadmin is actually going to know which makes more sense. > Unless we start tieing WAL parameters to the database size or > something like that. I think we need a way for the master to know who its slaves are and keep any given bit of WAL available until all slaves have succesfully read it, just as we keep each WAL file until we successfully copy it to the archive. Otherwise, there's no way to be sure that a connection break won't result in the need for a new base backup. (In a way, a slave is very similar to an additional archive.) ...Robert
Hi, On Thu, Jul 16, 2009 at 6:00 PM, Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> wrote: > The archive should not normally contain partial XLOG files, only if you > manually copy one there after primary has crashed. So I don't think > that's something we need to support. You are right. And, if the last valid record exists in the middle of the restored file (e.g. by XLOG_SWITCH record), <begin> should indicate the head of the next file. > Hmm. You only need the timeline history file if the base backup was > taken in an earlier timeline. That situation would only arise if you > (manually) take a base backup, restore to a server (which creates a new > timeline), and then create a slave against that server. At least in the > 1st phase, I think we can assume that the standby has access to the same > archive, and will find the history file from there. If not, throw an > error. We can add more bells and whistles later. Okey, I hold the problem about a history file for possible later consideration. > As the patch stands, new walsender connections are refused when one is > active already. What if the walsender connection is in a zombie state? > For example, it's trying to send WAL to the slave, but the network > connection is down, and the packets are going to a black hole. It will > take a while for the TCP layer to declare the connection dead, and close > the socket. During that time, you can't connect a new slave to the > master, or the same slave using a better network connection. > > The most robust way to fix that is to support multiple walsenders. The > zombie walsender can take its time to die, while the new walsender > serves the new connection. You could tweak SO_TIMEOUTs and stuff, but > even then the standby process could be in some weird hung state. > > And of course, when we get around to add support for multiple slaves, > we'll have to do that anyway. Better get it right to begin with. Thanks for the detailed description! I was thinking that a new GUC replication_timeout and some keepalive parameters would be enough to help with such trouble. But I agree that the support multiple walsenders is better solution, so I'll try this problem. > Even in synchronous replication, a backend should only have to wait when > it commits. You would only see the difference with very large > transactions that write more WAL than fits in wal_buffers, though, like > data loading. That's right. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Hi, On Fri, Jul 17, 2009 at 2:09 AM, Greg Stark<gsstark@mit.edu> wrote: > Only the sysadmin is actually going to know which makes more sense. > Unless we start tieing WAL parameters to the database size or > something like that. Agreed. And, if a user doesn't want to make a new base backup because of a large database, s/he can manually copy the archived WAL files to the standby before starting it, and make it use them for its recovery. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center