Thread: Sync Rep: First Thoughts on Code
Breaking down of patch into sections works very well for review. Should allow us to get different reviewers on different parts of the code - review wranglers please take note: Dave, Josh. Can you confirm that all the docs on the Wiki page are up to date? There are a few minor discrepancies that make me think it isn't. Examples: "For example, to make a single multi-statement transaction replication asynchronously when the default is the opposite, issue SET LOCAL synchronous_commit TO OFF within the transaction." Do we mean synchronous_replication in this sentence? I think you've copied the text and not changed all of the necessary parts - please re-read the whole section (probably the whole Wiki, actually). "wal_writer_delay" - do we mean wal_sender_delay? Is there some ability to measure the amount of data to be sent and avoid the delay altogether, when the server is sufficiently busy? The reaction to replication_timeout may need to be configurable. I might not want to keep on processing if the information didn't reach the standby. I would prefer in many cases that the transactions that were waiting for walsender would abort, but the walsender kept processing. How can we restart the walsender if it shuts down? Do we want a maximum wait for a transaction and a maximum wait for the server? Do we report stats on how long the replication has been taking? If the average rep time is close to rep timeout then we will be fragile, so we need some way to notice this and produce warnings. Or at least provide info to an external monitoring system. How do we specify the user we use to connect to primary? Definitely need more explanatory comments/README-style docs. For example, 03_libpq seems simple and self-contained. I'm not sure why we have a state called PGASYNC_REPLICATION; I was hoping that would be dynamic, but I'm not sure where to look for that. It would be useful to have a very long comment within the code to explain how the replication messages work, and note on each function who the intended client and server is. 02_pqcomm: What does HAVE_POLL mean? Do we need to worry about periodic renegotiation of keys in be-secure.c? Not sure I understand why so many new functions in there. 04_recovery_conf is a change I agree with, though I think it may not work with EXEC_BACKEND for Windows. 05... I need dome commentary to explain this better. 06 and 07 are large and will take substantial review time. So we must get the overall architecture done first and then check the code that implements that. 08 - I think I get this, but some docs will help to confirm. 09 pg_standby changes: so more changes are coming there? OK. Can we refer to those two options as failover and switchover? There's no need to change definitions that many Postgres people already use. This change can be done without making any change to server behaviour, so this change can have benefit to 8.2 and 8,3 people also. 01_signal_handling: I've looked at the LWlock acquires and releases in the patch and am fairly happy, except for the ProcArrayLock acquire during this sub-patch. Do we really need to do things this way? Is the actual state important? Could we just do this with a counter which cycles? So callers increment counter atomically and the reader just polls to see if anybody has incremented? Or could we protect that part of the proc with a different lock? Touching ProcArrayLock is bad news. Anyway, feeling very positive about this. Hope we can get this reviewed and committed in next 3-4 weeks. I have many clues as to how to structure my own work also. Thanks. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Hi, Simon. Thanks for taking many hours to review the code!! On Mon, Dec 1, 2008 at 8:42 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > Can you confirm that all the docs on the Wiki page are up to date? There > are a few minor discrepancies that make me think it isn't. Documentation is ongoing. Sorry for my slow progress. BTW, I'm going to add and change the sgml files listed on wiki. http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects#Documentation_Plan > > Examples: "For example, to make a single multi-statement transaction > replication asynchronously when the default is the opposite, issue SET > LOCAL synchronous_commit TO OFF within the transaction." > Do we mean synchronous_replication in this sentence? I think you've > copied the text and not changed all of the necessary parts - please > re-read the whole section (probably the whole Wiki, actually). Oops! It's just typo. Sorry for the confusion. I will revise this section. > > "wal_writer_delay" - do we mean wal_sender_delay? Yes. I will fix it. > Is there some ability > to measure the amount of data to be sent and avoid the delay altogether, > when the server is sufficiently busy? Why is the former ability required? The latter is possible, I think. We can guarantee that the WAL is sent (in more detail, called send(2)) once at least per wal_sender_delay. Of course, it's dependent on the scheduler of a kernel. > > The reaction to replication_timeout may need to be configurable. I might > not want to keep on processing if the information didn't reach the > standby. OK. I will add new GUC variable (PGC_SIGHUP) to specify the reaction for the timeout. > I would prefer in many cases that the transactions that were > waiting for walsender would abort, but the walsender kept processing. Is it dangerous to abort the transaction with replication continued when the timeout occurs? I think that the WAL consistency between two servers might be broken. Because the WAL writing and sending are done concurrently, and the backend might already write the WAL to disk on the primary when waiting for walsender. > How can we restart the walsender if it shuts down? Only restart the standby (with walreceiver). The standby connects to the postmaster on the primary, then the postmaster forks new walsender. > Do we want a maximum > wait for a transaction and a maximum wait for the server? ISTM that these feature are too much. > Do we report > stats on how long the replication has been taking? If the average rep > time is close to rep timeout then we will be fragile, so we need some > way to notice this and produce warnings. Or at least provide info to an > external monitoring system. Sounds good. How about log_min_duration_replication? If the rep time is greater than it, we produce warning (or log) like log_min_duration_xx. > > How do we specify the user we use to connect to primary? Yes, I need to add new option to specify the user name into recovery.conf. Thanks for reminding me! > > Definitely need more explanatory comments/README-style docs. Completely agreed ;-) I will write README together with other documents. > > For example, 03_libpq seems simple and self-contained. I'm not sure why > we have a state called PGASYNC_REPLICATION; I was hoping that would be > dynamic, but I'm not sure where to look for that. It would be useful to > have a very long comment within the code to explain how the replication > messages work, and note on each function who the intended client and > server is. > OK. I will re-consider whether PGASYNC_REPLICATION is removable, and write the comment about it. > 02_pqcomm: What does HAVE_POLL mean? It identifies whether poll(2) is available or not on the platform. We use poll(2) if it's defined, otherwise select(2). There is similar code at pqSocketPoll() in fe-misc.c. > Do we need to worry about periodic > renegotiation of keys in be-secure.c? What is "keys" you mean? > Not sure I understand why so many > new functions in there. It's because walsender waits for the reply from the standby and the request from the backend concurrently. So, we need poll(2) or select(2) to make walsender wait for them, and some functions for non-blocking receiving. > > 04_recovery_conf is a change I agree with, though I think it may not > work with EXEC_BACKEND for Windows. OK. I will examine and fix it. > > 05... I need dome commentary to explain this better. > > 06 and 07 are large and will take substantial review time. So we must > get the overall architecture done first and then check the code that > implements that. > > 08 - I think I get this, but some docs will help to confirm. Yes. I need more documentation. > > 09 pg_standby changes: so more changes are coming there? OK. Can we > refer to those two options as failover and switchover? You mean failover trigger and switchover one? ISTM that those names and features might not suit. Naming always bother me, and the current name "commit/abort trigger" might tend to cause confusion. Is there any other suitable name? > There's no need > to change definitions that many Postgres people already use. This change > can be done without making any change to server behaviour, so this > change can have benefit to 8.2 and 8,3 people also. Agreed. > > 01_signal_handling: I've looked at the LWlock acquires and releases in > the patch and am fairly happy, except for the ProcArrayLock acquire > during this sub-patch. Do we really need to do things this way? Is the > actual state important? Could we just do this with a counter which > cycles? So callers increment counter atomically and the reader just > polls to see if anybody has incremented? Or could we protect that part > of the proc with a different lock? Touching ProcArrayLock is bad news. Agreed. I will add new lock for proc.signalFlags. > > Anyway, feeling very positive about this. Hope we can get this reviewed > and committed in next 3-4 weeks. > > I have many clues as to how to structure my own work also. Thanks. Thanks again! Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, 2008-12-02 at 21:37 +0900, Fujii Masao wrote: > Thanks for taking many hours to review the code!! > > On Mon, Dec 1, 2008 at 8:42 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > Can you confirm that all the docs on the Wiki page are up to date? There > > are a few minor discrepancies that make me think it isn't. > > Documentation is ongoing. Sorry for my slow progress. > > BTW, I'm going to add and change the sgml files listed on wiki. > http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects#Documentation_Plan I'm patient, I know it takes time. Happy to spend hours on the review, but I want to do that knowing I agree with the higher level features and architecture first. This was just a first review, I expect to spend more time on it yet. > > The reaction to replication_timeout may need to be configurable. I might > > not want to keep on processing if the information didn't reach the > > standby. > > OK. I will add new GUC variable (PGC_SIGHUP) to specify the reaction for > the timeout. > > > I would prefer in many cases that the transactions that were > > waiting for walsender would abort, but the walsender kept processing. > > Is it dangerous to abort the transaction with replication continued when > the timeout occurs? I think that the WAL consistency between two servers > might be broken. Because the WAL writing and sending are done concurrently, > and the backend might already write the WAL to disk on the primary when > waiting for walsender. The issue I see is that we might want to keep wal_sender_delay small so that transaction times are not increased. But we also want wal_sender_delay high so that replication never breaks. It seems better to have the action on wal_sender_delay configurable if we have an unsteady network (like the internet). Marcus made some comments on line dropping that seem relevant here; we should listen to his experience. Hmmm, dangerous? Well assuming we're linking commits with replication sends then it sounds it. We might end up committing to disk and then deciding to abort instead. But remember we don't remove the xid from procarray or mark the result in clog until the flush is over, so it is possible. But I think we should discuss this in more detail when the main patch is committed. > > Do we report > > stats on how long the replication has been taking? If the average rep > > time is close to rep timeout then we will be fragile, so we need some > > way to notice this and produce warnings. Or at least provide info to an > > external monitoring system. > > Sounds good. How about log_min_duration_replication? If the rep time > is greater than it, we produce warning (or log) like log_min_duration_xx. Maybe, lets put in something that logs if >50% (?) of timeout. Make that configurable with a #define and see if we need that to be configurable with a GUC later. > > Do we need to worry about periodic > > renegotiation of keys in be-secure.c? > > What is "keys" you mean? See the notes in that file for explanation. I wondered whether it might be a perf problem for us? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Tue, 2008-12-02 at 11:08 -0800, Jeff Davis wrote: > On Tue, 2008-12-02 at 13:09 +0000, Simon Riggs wrote: > > > Is it dangerous to abort the transaction with replication continued when > > > the timeout occurs? I think that the WAL consistency between two servers > > > might be broken. Because the WAL writing and sending are done concurrently, > > > and the backend might already write the WAL to disk on the primary when > > > waiting for walsender. > > > > The issue I see is that we might want to keep wal_sender_delay small so > > that transaction times are not increased. But we also want > > wal_sender_delay high so that replication never breaks. It seems better > > to have the action on wal_sender_delay configurable if we have an > > unsteady network (like the internet). Marcus made some comments on line > > dropping that seem relevant here; we should listen to his experience. > > > > Hmmm, dangerous? Well assuming we're linking commits with replication > > sends then it sounds it. We might end up committing to disk and then > > deciding to abort instead. But remember we don't remove the xid from > > procarray or mark the result in clog until the flush is over, so it is > > possible. But I think we should discuss this in more detail when the > > main patch is committed. > > > > What is the "it" in "it is possible"? It seems like there's still a > problem window in there. Marking a transaction aborted after we have written a commit record, but before we have removed it from proc array and marked in clog. We'd need a special kind of WAL record to do that. > Even if that could be made safe, in the event of a real network failure, > you'd just wait the full timeout every transaction, because it still > thinks it's replicating. True, but I did suggest having two timeouts. There is considerable reason to reduce the timeout as well as reason to increase it - at the same time. Anyway, lets wait for some user experience following commit. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Tue, 2008-12-02 at 13:09 +0000, Simon Riggs wrote: > > Is it dangerous to abort the transaction with replication continued when > > the timeout occurs? I think that the WAL consistency between two servers > > might be broken. Because the WAL writing and sending are done concurrently, > > and the backend might already write the WAL to disk on the primary when > > waiting for walsender. > > The issue I see is that we might want to keep wal_sender_delay small so > that transaction times are not increased. But we also want > wal_sender_delay high so that replication never breaks. It seems better > to have the action on wal_sender_delay configurable if we have an > unsteady network (like the internet). Marcus made some comments on line > dropping that seem relevant here; we should listen to his experience. > > Hmmm, dangerous? Well assuming we're linking commits with replication > sends then it sounds it. We might end up committing to disk and then > deciding to abort instead. But remember we don't remove the xid from > procarray or mark the result in clog until the flush is over, so it is > possible. But I think we should discuss this in more detail when the > main patch is committed. > What is the "it" in "it is possible"? It seems like there's still a problem window in there. Even if that could be made safe, in the event of a real network failure, you'd just wait the full timeout every transaction, because it still thinks it's replicating. If the timeout is exceeded, it seems more reasonable to abandon the slave until you could re-sync it and continue processing as normal. As you pointed out, that's not necessarily an expensive operation because you can use something like rsync. The process of re-syncing might be made easier (or perhaps less costly), of course. If we want to still allow processing to happen after a timeout, it seems reasonable to have a configurable option to allow/disallow non-read-only transactions when out of sync. Regards,Jeff Davis
> Breaking down of patch into sections works very well for review. Should > allow us to get different reviewers on different parts of the code - > review wranglers please take note: Dave, Josh. Fujii-san, could you break the patch up into several parts? We have quite a few junior reviewers who are idle right now. -- --Josh Josh Berkus PostgreSQL San Francisco
Jeff, > Even if that could be made safe, in the event of a real network failure, > you'd just wait the full timeout every transaction, because it still > thinks it's replicating. Hmmm. I'd suggest that if we get timeouts for more than 10xTimeout value in a row, that replication stops. Unfortunatley, we should probably make that *another* configuration setting. -- --Josh Josh Berkus PostgreSQL San Francisco
Hi, On Wed, Dec 3, 2008 at 6:03 AM, Josh Berkus <josh@agliodbs.com> wrote: > >> Breaking down of patch into sections works very well for review. Should >> allow us to get different reviewers on different parts of the code - >> review wranglers please take note: Dave, Josh. > > Fujii-san, could you break the patch up into several parts? We have quite > a few junior reviewers who are idle right now. Yes, I divided the patch into 9 pieces. Do I need to divide it further? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii-san, > Yes, I divided the patch into 9 pieces. Do I need to divide it further? That's plenty. Where do reviews find the 9 pieces? -- Josh Berkus PostgreSQL San Francisco
Hi, On Wed, Dec 3, 2008 at 3:21 PM, Josh Berkus <josh@agliodbs.com> wrote: > Fujii-san, > >> Yes, I divided the patch into 9 pieces. Do I need to divide it further? > > That's plenty. Where do reviews find the 9 pieces? The latest patch set (v4) is on wiki. http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects#Patch_set Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Hello, On Tue, Dec 2, 2008 at 10:09 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> > The reaction to replication_timeout may need to be configurable. I might >> > not want to keep on processing if the information didn't reach the >> > standby. >> >> OK. I will add new GUC variable (PGC_SIGHUP) to specify the reaction for >> the timeout. >> >> > I would prefer in many cases that the transactions that were >> > waiting for walsender would abort, but the walsender kept processing. >> >> Is it dangerous to abort the transaction with replication continued when >> the timeout occurs? I think that the WAL consistency between two servers >> might be broken. Because the WAL writing and sending are done concurrently, >> and the backend might already write the WAL to disk on the primary when >> waiting for walsender. > > The issue I see is that we might want to keep wal_sender_delay small so > that transaction times are not increased. But we also want > wal_sender_delay high so that replication never breaks. Are you assuming only asynch case? In synch case, since walsender is awoken by the signal from the backend, we don't need to keep the delay so small. And, wal_sender_delay has no relation with the mis-termination of replication. > It seems better > to have the action on wal_sender_delay configurable if we have an > unsteady network (like the internet). Marcus made some comments on line > dropping that seem relevant here; we should listen to his experience. OK, I would look for his comments. Please let me know which thread has the comments if you know. > > Hmmm, dangerous? Well assuming we're linking commits with replication > sends then it sounds it. We might end up committing to disk and then > deciding to abort instead. But remember we don't remove the xid from > procarray or mark the result in clog until the flush is over, so it is > possible. But I think we should discuss this in more detail when the > main patch is committed. If the transaction is aborted while the backend is waiting for replication, the transaction commit command returns "false" indication to the client. But the transaction commit record might be written in the primary and standby. As you say, it may not be dangerous as long as the primary is alive. But, when we recover the failed primary, clog of the transaction is marked with "success" because of the commit record. Is it safe? And, in that case, the transaction is treated as "sucess" on the standby, and visible for the read-only query. On the other hand, it's invisible on the primary. Isn't it dangerous? > >> > Do we need to worry about periodic >> > renegotiation of keys in be-secure.c? >> >> What is "keys" you mean? > > See the notes in that file for explanation. Thanks! I would check it. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Hi, On Wed, Dec 3, 2008 at 4:08 AM, Jeff Davis <pgsql@j-davis.com> wrote: > Even if that could be made safe, in the event of a real network failure, > you'd just wait the full timeout every transaction, because it still > thinks it's replicating. If walsender detects a real network failure, the transaction doesn't need to wait for the timeout. Configuring keepalive options would help walsender to detect it. Of course, though keepalive on linux might not work as expected. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Hi, On Tue, Dec 2, 2008 at 10:09 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Tue, 2008-12-02 at 21:37 +0900, Fujii Masao wrote: > >> Thanks for taking many hours to review the code!! >> >> On Mon, Dec 1, 2008 at 8:42 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> > Can you confirm that all the docs on the Wiki page are up to date? There >> > are a few minor discrepancies that make me think it isn't. >> >> Documentation is ongoing. Sorry for my slow progress. >> >> BTW, I'm going to add and change the sgml files listed on wiki. >> http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects#Documentation_Plan > > I'm patient, I know it takes time. Happy to spend hours on the review, > but I want to do that knowing I agree with the higher level features and > architecture first. Since I thought that the figure was more intelligible for some people than my poor English, I illustrated the architecture first. http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects#Detailed_Design Are there any other parts which should be illustrated for review? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, 2008-12-03 at 21:37 +0900, Fujii Masao wrote: > Since I thought that the figure was more intelligible for some people > than my poor English, I illustrated the architecture first. > http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects#Detailed_Design > > Are there any other parts which should be illustrated for review? Those are very useful, thanks. Some questions to check my understanding (expected answers in brackets) * Diagram on p.2 has two Archives. We have just one (yes) * We send data continuously, whether or not we are in sync/async? (yes) So the only difference between sync/async is whether we wait when we flush the commit? (yes) * If we have synchronous_commit = off do we ignore synchronous_replication = on (yes) * If two transactions commit almost simultaneously and one is sync and the other async then only the sync backend will wait? (Yes) Do we definitely need the archiver to move the files written by walreceiver to archive and then move them back out again? Seems like we can streamline that part in many (all?) cases. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Hi, On Wed, Dec 3, 2008 at 11:33 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > I'm patient, I know it takes time. Happy to spend hours on the review, > but I want to do that knowing I agree with the higher level features and > architecture first. I wrote the features and restrictions of Synch Rep. Please also check it together with the figures of architecture. http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects#User_Overview > Some questions to check my understanding (expected answers in brackets) > > * Diagram on p.2 has two Archives. We have just one (yes) No, we need archive in both the primary and standby. The primary needs archive because a base backup is required when starting the standby. Meanwhile, the standby needs archive for cooperating with pg_standby. If the directory where pg_standby checks is the same as the directory where walreceiver writes the WAL, the halfway WAL file might be restored by pg_standby, and continuous recovery would fail. So, we have to separate the directories, and I assigned pg_xlog and archive to them. Another idea; walreceiver writes the WAL to the file with temporary name, and rename it to the formal name when it fills. So, pg_standby doesn't restore a halfway WAL file. But it's more difficult to perform the failover because the unrenamed WAL file remains. Do you have any other good idea? > > * We send data continuously, whether or not we are in sync/async? (yes) Yes. > So the only difference between sync/async is whether we wait when we > flush the commit? (yes) Yes. And, in asynch case, the backend basically doesn't send the wakeup-signal to walsender. > > * If we have synchronous_commit = off do we ignore > synchronous_replication = on (yes) No, we can configure them independently. synchronous_commit covers only local writing of the WAL. If synch_*commit* should cover both local writing and replication, I'd like to add new GUC which covers only local writing (synchronous_local_write?). > > * If two transactions commit almost simultaneously and one is sync and > the other async then only the sync backend will wait? (Yes) Yes. > > > Do we definitely need the archiver to move the files written by > walreceiver to archive and then move them back out again? Yes, it's because of cooperating with pg_standby. > Seems like we > can streamline that part in many (all?) cases. Agreed. But I thought that such streaming was TODO of next time. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Hi, On Wed, Dec 3, 2008 at 3:38 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >>> > Do we need to worry about periodic >>> > renegotiation of keys in be-secure.c? >>> >>> What is "keys" you mean? >> >> See the notes in that file for explanation. > > Thanks! I would check it. The key is used only when we use SSL for the connection of replication. As far as I examined, secure_write() renegotiates the key if needed. Since walsender calls secure_write() when sending the WAL to the standby, the key is renegotiated periodically. So, I think that we don't need to worry about the obsolescence of the key. Am I missing something? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, 2008-12-04 at 17:57 +0900, Fujii Masao wrote: > On Wed, Dec 3, 2008 at 3:38 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > >>> > Do we need to worry about periodic > >>> > renegotiation of keys in be-secure.c? > >>> > >>> What is "keys" you mean? > >> > >> See the notes in that file for explanation. > > > > Thanks! I would check it. > > The key is used only when we use SSL for the connection of > replication. As far as I examined, secure_write() renegotiates > the key if needed. Since walsender calls secure_write() when > sending the WAL to the standby, the key is renegotiated > periodically. So, I think that we don't need to worry about the > obsolescence of the key. Understood. Is the periodic renegotiation of keys something that would interfere with the performance or robustness of replication? Is the delay likely to effect sync rep? I'm just checking we've thought about it. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Thu, 2008-12-04 at 16:10 +0900, Fujii Masao wrote: > > * Diagram on p.2 has two Archives. We have just one (yes) > > No, we need archive in both the primary and standby. The primary needs > archive because a base backup is required when starting the standby. > Meanwhile, the standby needs archive for cooperating with pg_standby. > > If the directory where pg_standby checks is the same as the directory > where walreceiver writes the WAL, the halfway WAL file might be > restored by pg_standby, and continuous recovery would fail. So, we have > to separate the directories, and I assigned pg_xlog and archive to them. > > Another idea; walreceiver writes the WAL to the file with temporary name, > and rename it to the formal name when it fills. So, pg_standby doesn't > restore a halfway WAL file. But it's more difficult to perform the failover > because the unrenamed WAL file remains. WAL sending is either via archiver or via streaming. We must switch cleanly from one mode to the other and not half-way through a WAL file. When WAL sending is about to begin, issue xlog switch. Then tell archiver to shutdown once it has got to the last file. All files after that point are streamed. So there need be no conflict in filename. We must avoid having two archives, because people will configure this incorrectly. > > * If we have synchronous_commit = off do we ignore > > synchronous_replication = on (yes) > > No, we can configure them independently. synchronous_commit covers > only local writing of the WAL. If synch_*commit* should cover both local > writing and replication, I'd like to add new GUC which covers only local > writing (synchronous_local_write?). The only sensible settings are synchronous_commit = on, synchronous_replication = on synchronous_commit = on, synchronous_replication = off synchronous_commit = off, synchronous_replication = off This doesn't make any sense: (does it??) synchronous_commit = off, synchronous_replication = on > > Do we definitely need the archiver to move the files written by > > walreceiver to archive and then move them back out again? > > Yes, it's because of cooperating with pg_standby. It seems very easy to make this happen the way we want. We could make pg_standby look into pg_xlog also, for example. I was expecting you to have walreceiver and startup share an end of WAL address via shared memory, so that startup never tries to read past end. That way we would be able to begin reading a WAL file *before* it was filled. Waiting until a file fills means we still have to have archive_timeout set to ensure we switch regularly. We need the existing mechanisms for the start of replication (base backup etc..) but we don't need them after that point. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > On Thu, 2008-12-04 at 17:57 +0900, Fujii Masao wrote: > >> On Wed, Dec 3, 2008 at 3:38 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >>>>>> Do we need to worry about periodic >>>>>> renegotiation of keys in be-secure.c? >>>>> What is "keys" you mean? >>>> See the notes in that file for explanation. >>> Thanks! I would check it. >> The key is used only when we use SSL for the connection of >> replication. As far as I examined, secure_write() renegotiates >> the key if needed. Since walsender calls secure_write() when >> sending the WAL to the standby, the key is renegotiated >> periodically. So, I think that we don't need to worry about the >> obsolescence of the key. > > Understood. Is the periodic renegotiation of keys something that would > interfere with the performance or robustness of replication? Is the > delay likely to effect sync rep? I'm just checking we've thought about > it. It will certainly add an extra piece of delay. But if you are worried about performance for it, you are likely not running SSL. Plus, if you don't renegotiate the key, you gamble with security. If it does have a negative effect on the robustness of the replication, we should just recommend against using it - or refuse to use - not disable renegotiation. /Magnus
On Thu, 2008-12-04 at 12:41 +0100, Magnus Hagander wrote: > > Understood. Is the periodic renegotiation of keys something that would > > interfere with the performance or robustness of replication? Is the > > delay likely to effect sync rep? I'm just checking we've thought about > > it. > > It will certainly add an extra piece of delay. But if you are worried > about performance for it, you are likely not running SSL. Plus, if you > don't renegotiate the key, you gamble with security. > > If it does have a negative effect on the robustness of the replication, > we should just recommend against using it - or refuse to use - not > disable renegotiation. I didn't mean to imply renegotiation might optional. I just wanted to check whether there is anything to worry about as a result of it, there may not be. *If* it took a long time, I would not want sync commits to wait for it. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Hi, On Thu, Dec 4, 2008 at 6:29 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > The only sensible settings are > synchronous_commit = on, synchronous_replication = on > synchronous_commit = on, synchronous_replication = off > synchronous_commit = off, synchronous_replication = off > > This doesn't make any sense: (does it??) > synchronous_commit = off, synchronous_replication = on If the standby replies before writing the WAL, that strategy can improve the performance with moderate reliability, and sounds sensible. IIRC, MySQL Cluster might use that strategy. > I was expecting you to have walreceiver and startup share an end of WAL > address via shared memory, so that startup never tries to read past end. > That way we would be able to begin reading a WAL file *before* it was > filled. Waiting until a file fills means we still have to have > archive_timeout set to ensure we switch regularly. You mean that not pg_standby but startup process waits for the next WAL available? If so, I agree with you in the future. That is, I just think that this is next TODO because there are many problems which we should resolve carefully to achieve it. But, if it's essential for 8.4, I will tackle it. What is your opinion? I'd like to clear up the goal for 8.4. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Hello, On Fri, Dec 5, 2008 at 12:09 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> I was expecting you to have walreceiver and startup share an end of WAL >> address via shared memory, so that startup never tries to read past end. >> That way we would be able to begin reading a WAL file *before* it was >> filled. Waiting until a file fills means we still have to have >> archive_timeout set to ensure we switch regularly. > > You mean that not pg_standby but startup process waits for the next > WAL available? If so, I agree with you in the future. That is, I just think > that this is next TODO because there are many problems which we > should resolve carefully to achieve it. But, if it's essential for 8.4, I will > tackle it. What is your opinion? I'd like to clear up the goal for 8.4. Umm.. on second thought, this feature (continuous recovery without pg_standby) seems to be essential for 8.4. So, I will try it. Development plan: - Share the end of WAL address via shared memory <--- Done! - Change ReadRecord() to wait for the next WAL *record* available. - Change ReadRecord() to restore the WAL from archive by using pg_standby before reaching the replication starting position,then read the half-streaming WAL from pg_xlog. - Add new trigger for promoting the standby to the primary. As the trigger, when fast shudown (SIGINT) is requested duringrecovery, the standby would recover the WAL up to end and become the primary. What system call does walreceiver have to call against the WAL before startup process reads it? Probably we need to call write(2), and don't need fsync(2) in Linux. How about other platform? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Hi, sorry for my consecutive posting. On Fri, Dec 5, 2008 at 4:00 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > Hello, > > On Fri, Dec 5, 2008 at 12:09 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >>> I was expecting you to have walreceiver and startup share an end of WAL >>> address via shared memory, so that startup never tries to read past end. >>> That way we would be able to begin reading a WAL file *before* it was >>> filled. Waiting until a file fills means we still have to have >>> archive_timeout set to ensure we switch regularly. >> >> You mean that not pg_standby but startup process waits for the next >> WAL available? If so, I agree with you in the future. That is, I just think >> that this is next TODO because there are many problems which we >> should resolve carefully to achieve it. But, if it's essential for 8.4, I will >> tackle it. What is your opinion? I'd like to clear up the goal for 8.4. > > Umm.. on second thought, this feature (continuous recovery without > pg_standby) seems to be essential for 8.4. So, I will try it. > > Development plan: > - Share the end of WAL address via shared memory <--- Done! > - Change ReadRecord() to wait for the next WAL *record* available. > - Change ReadRecord() to restore the WAL from archive by using > pg_standby before reaching the replication starting position, then > read the half-streaming WAL from pg_xlog. > - Add new trigger for promoting the standby to the primary. As the > trigger, when fast shudown (SIGINT) is requested during recovery, > the standby would recover the WAL up to end and become the > primary. > > What system call does walreceiver have to call against the WAL > before startup process reads it? Probably we need to call write(2), > and don't need fsync(2) in Linux. How about other platform? I added the figures about the latest architecture into PDF file. Please check P6, 7. Is this architecture close to your imege? http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects#Detailed_Design Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, 2008-12-05 at 16:00 +0900, Fujii Masao wrote: > On Fri, Dec 5, 2008 at 12:09 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > >> I was expecting you to have walreceiver and startup share an end of WAL > >> address via shared memory, so that startup never tries to read past end. > >> That way we would be able to begin reading a WAL file *before* it was > >> filled. Waiting until a file fills means we still have to have > >> archive_timeout set to ensure we switch regularly. > > > > You mean that not pg_standby but startup process waits for the next > > WAL available? If so, I agree with you in the future. That is, I just think > > that this is next TODO because there are many problems which we > > should resolve carefully to achieve it. But, if it's essential for 8.4, I will > > tackle it. What is your opinion? I'd like to clear up the goal for 8.4. > > Umm.. on second thought, this feature (continuous recovery without > pg_standby) seems to be essential for 8.4. So, I will try it. Sounds good. Perhaps you can share what changed your mind in those 4 hours... Could we start with pictures and some descriptions first, so we know we're on the right track? I foresee no coding issues. My understanding is that we start with a normal log shipping architecture, then we switch into continuous recovery mode. So we do use pg_standby at beginning, but then it gets turned off. Let's look at all of the corner cases also: * standby keeps pace with primary (desired state) * standby falls behind primary * standby restarts to change shmmem settings etc -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Fri, 2008-12-05 at 12:09 +0900, Fujii Masao wrote: > On Thu, Dec 4, 2008 at 6:29 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > The only sensible settings are > > synchronous_commit = on, synchronous_replication = on > > synchronous_commit = on, synchronous_replication = off > > synchronous_commit = off, synchronous_replication = off > > > > This doesn't make any sense: (does it??) > > synchronous_commit = off, synchronous_replication = on > > If the standby replies before writing the WAL, that strategy can improve > the performance with moderate reliability, and sounds sensible. Do you think it likely that your replication time is consistently and noticeably less than your time-to-disk? If not, you'll wait just as long but be less robust. I guess its possible. On a related thought: presumably we force a sync rep if forceSyncCommit is set? > IIRC, MySQL Cluster might use that strategy. Not the most convincing argument I've heard. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Hi, On Fri, Dec 5, 2008 at 7:09 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Fri, 2008-12-05 at 12:09 +0900, Fujii Masao wrote: > >> On Thu, Dec 4, 2008 at 6:29 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> > The only sensible settings are >> > synchronous_commit = on, synchronous_replication = on >> > synchronous_commit = on, synchronous_replication = off >> > synchronous_commit = off, synchronous_replication = off >> > >> > This doesn't make any sense: (does it??) >> > synchronous_commit = off, synchronous_replication = on >> >> If the standby replies before writing the WAL, that strategy can improve >> the performance with moderate reliability, and sounds sensible. > > Do you think it likely that your replication time is consistently and > noticeably less than your time-to-disk? It depends on a system environment. - How many miles two servers? same rack? separate continent? - Does system have high-end storage? cheap one? ... etc > > On a related thought: presumably we force a sync rep if forceSyncCommit > is set? Yes! Please see RecordTransactionCommit() in xact.c (in my patch). Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Greetings! On Fri, Dec 5, 2008 at 6:59 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Fri, 2008-12-05 at 16:00 +0900, Fujii Masao wrote: > >> On Fri, Dec 5, 2008 at 12:09 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> >> I was expecting you to have walreceiver and startup share an end of WAL >> >> address via shared memory, so that startup never tries to read past end. >> >> That way we would be able to begin reading a WAL file *before* it was >> >> filled. Waiting until a file fills means we still have to have >> >> archive_timeout set to ensure we switch regularly. >> > >> > You mean that not pg_standby but startup process waits for the next >> > WAL available? If so, I agree with you in the future. That is, I just think >> > that this is next TODO because there are many problems which we >> > should resolve carefully to achieve it. But, if it's essential for 8.4, I will >> > tackle it. What is your opinion? I'd like to clear up the goal for 8.4. >> >> Umm.. on second thought, this feature (continuous recovery without >> pg_standby) seems to be essential for 8.4. So, I will try it. > > Sounds good. Perhaps you can share what changed your mind in those 4 > hours... Yeah, it's my imagination about the real situation after 8.4 release, especially I considered about the future conjugal life of Synch Rep and Hot Standby ;) Waiting to redo until the file fills might lead to marital breakdown. > > Could we start with pictures and some descriptions first, so we know > we're on the right track? I foresee no coding issues. > > My understanding is that we start with a normal log shipping > architecture, then we switch into continuous recovery mode. So we do use > pg_standby at beginning, but then it gets turned off. Yes, I also understand so. Updated sequence pictures are on wiki as per usual. Please see P3, 4. http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects#Detailed_Design > > Let's look at all of the corner cases also: > * standby keeps pace with primary (desired state) > * standby falls behind primary > * standby restarts to change shmmem settings > etc Yes, I will examine such cases! Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Sat, 2008-12-06 at 17:55 +0900, Fujii Masao wrote: > Yeah, it's my imagination about the real situation after 8.4 release, > especially I considered about the future conjugal life of Synch Rep and > Hot Standby ;) Waiting to redo until the file fills might lead to marital > breakdown. You're obviously working with some comedians now. ;-) > > Could we start with pictures and some descriptions first, so we know > > we're on the right track? I foresee no coding issues. > > > > My understanding is that we start with a normal log shipping > > architecture, then we switch into continuous recovery mode. So we do use > > pg_standby at beginning, but then it gets turned off. > > Yes, I also understand so. Updated sequence pictures are on wiki > as per usual. Please see P3, 4. > http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects#Detailed_Design p.6 looks good. But what is p.7? It's even more complex than the original. Forgive me, but I don't understand that. Can you explain? What is the procedure if the standby shuts down, for example if we wish to restart server to change a parameter? Or to reboot the system it is on. Does the primary switch back to writing files to archive? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Hi, thanks for the comment! On Mon, Dec 8, 2008 at 11:04 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> > Could we start with pictures and some descriptions first, so we know >> > we're on the right track? I foresee no coding issues. >> > >> > My understanding is that we start with a normal log shipping >> > architecture, then we switch into continuous recovery mode. So we do use >> > pg_standby at beginning, but then it gets turned off. >> >> Yes, I also understand so. Updated sequence pictures are on wiki >> as per usual. Please see P3, 4. >> http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects#Detailed_Design > > p.6 looks good. > > But what is p.7? It's even more complex than the original. Forgive me, > but I don't understand that. Can you explain? p.7 shows one of the system configuration examples. Some people don't want to share an archive between two servers would probably choose this configuration, I think. If archive is not shared, some WAL files before replication starts would not be copied automatically from the primary to standby. So, we have to copy them by hand or using clusterware ..etc. This is what p.7 shows. If archive is shared, archiver on the primary would copy them automatically (p.6). > > What is the procedure if the standby shuts down, for example if we wish > to restart server to change a parameter? Stop postgres by using immediate shutdown, and start postgres from an existing database cluster directory. When restarting postgres, if there are one or more archives, we also need to copy the WAL files after stopping replication before restarting replication. > Or to reboot the system it is > on. Does the primary switch back to writing files to archive? I assume that the primary always writes files to archive, that is, basically the primary doesn't switch to non-archiving mode. Of course, if archiving is disabled on the primary in any reason when restarting standby, the primary need to switch back. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, 2008-12-09 at 17:15 +0900, Fujii Masao wrote: > > > > But what is p.7? It's even more complex than the original. Forgive me, > > but I don't understand that. Can you explain? > > p.7 shows one of the system configuration examples. Some people don't > want to share an archive between two servers would probably choose > this configuration, I think. > > If archive is not shared, some WAL files before replication starts would not > be copied automatically from the primary to standby. So, we have to copy > them by hand or using clusterware ..etc. This is what p.7 shows. If archive > is shared, archiver on the primary would copy them automatically (p.6). I agree that is the way to do it *if* the archive is not shared. But why would you want to *not* share the archive?? > > What is the procedure if the standby shuts down, for example if we wish > > to restart server to change a parameter? > > Stop postgres by using immediate shutdown, and start postgres from an > existing database cluster directory. When restarting postgres, if there are > one or more archives, we also need to copy the WAL files after stopping > replication before restarting replication. > > > Or to reboot the system it is > > on. Does the primary switch back to writing files to archive? > > I assume that the primary always writes files to archive, that is, basically > the primary doesn't switch to non-archiving mode. OK, I think that clears up what I was seeing in the code. i.e. I didn't understand the modes of operation. I really like most of what you've done, though you must forgive me for saying I still don't like this. I really am with you on how tiresome that sounds. For clarity: I don't think its acceptable to have the archiver send files to the archive at the same time as we're streaming data. In normal running we should not duplicate the data paths - its just too much data volume and/or bandwidth. The cleanest way I can see is to have two modes of operation: * First mode is file-based log shipping (FLS) (i.e. "warm standby") * Second mode is streaming log shipping (SLS) (wal sender to wal receiver) When we start we are in FLS mode, then we catch up to the cross-over point and we switch to SLS mode. If streaming stops, we just switch back to FLS mode. If they reconnect, we follow same procedure again. So the two modes are compatible, but are never simultaneously active except for a short period when we switch modes. If SLS mode is active then the archiver doesn't send files. If FLS mode is active, we send files. All of the places in code that currently are not optimised when XLogArchivingActive() must remain unoptimised for either FLS or SLS mode, so we need a new name for that. This makes least number of changes to existing architecture. People currently use FLS mode and understand it (!), they just add understanding of SLS mode. It's also a very straightforward architecture, which means fewer code paths and less weird bugs. (There's been enough already, as you know). So just for clarity, let me rephrase it: We set up FLS mode as we do currently. Then we initiate SLS mode. At the end of the next WAL file on primary we archive it, then turn off archiving on primary. (So for up to one WAL file we operate two modes together). If SLS mode ends, we send next WAL file via archiver. Some part of that file has already been streamed across, but that doesn't matter. (If SLS mode ends because primary is down, we obviously do nothing. If we have a split brain situation then we rely on clusterware to kill us (STONITH). So AFAICS p.6 of the architecture is all we really need. Nice, simple. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > For clarity: I don't think its acceptable to have the archiver send > files to the archive at the same time as we're streaming data. In normal > running we should not duplicate the data paths - its just too much data > volume and/or bandwidth. What if you want to run archiving for backup purposes, and also have a standby server? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, 2008-12-09 at 14:42 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > For clarity: I don't think its acceptable to have the archiver send > > files to the archive at the same time as we're streaming data. In normal > > running we should not duplicate the data paths - its just too much data > > volume and/or bandwidth. > > What if you want to run archiving for backup purposes, and also have a > standby server? If we want to include that as an option, yes. If it is "always on" then no, not everybody wants that. The best way to implement that is to archive from the standby, not to send the data twice. By definition the archive is more closely associated with the standby node than the primary. Maybe I misunderstood the diagrams? The additional flows to the archive are actually all optional? Anyway, I enclose a slightly simplified version of p.6 to allow us to see the progression of file mode through to streaming mode. This is an in-my-understanding version. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Attachment
Hi, Thanks for explaining the architecture in detail! > If we want to include that as an option, yes. If it is "always on" then > no, not everybody wants that. Yes. I also think that archiving should be optional on each servers. > The best way to implement that is to archive from the standby, not to > send the data twice. By definition the archive is more closely > associated with the standby node than the primary. > > Maybe I misunderstood the diagrams? The additional flows to the archive > are actually all optional? > > Anyway, I enclose a slightly simplified version of p.6 to allow us to > see the progression of file mode through to streaming mode. This is an > in-my-understanding version. Yes, I basically agree with you! The only difference between us is whether the primary also has to switch two modes (FLS <-> SLS). I think that the primary don't need to stop archiving forcibly when replication starts, which should be optional for the user. The user who doesn't want to archive can disable archiving by using existing mechanism (change archive_command & pg_ctl reload). It's more complicated to switch the modes on each servers. For clarity: the user can choose the strategy of archiving from the following. 1) each primary and standby archives 2) only primary archives 3) only standby archives 4) no server archives The user who don't want to share an archive would choose 1). The user who want to share an archive and cannot accept any increase of bandwidth would choose 4). On the other hand, the user who can accept it would choose 2) or 3). I prefer 2) to 3), for multiple standby in the future. And, if 3) is adopted, I wonder if we can get a base backup. Can we get it from the standby during recovery? > I agree that is the way to do it *if* the archive is not shared. But why > would you want to *not* share the archive?? First of all, I'd not like to buy a machine only for an archive other than the primary and standby. Meanwhile, if an archive is located on either the primary or standby (which should we locate it on?), post-failure processing is complicated. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, 2008-12-10 at 14:51 +0900, Fujii Masao wrote: > Yes, I basically agree with you! The only difference between us is > whether the primary also has to switch two modes (FLS <-> SLS). > I think that the primary don't need to stop archiving forcibly when > replication starts, which should be optional for the user. The user > who doesn't want to archive can disable archiving by using existing > mechanism (change archive_command & pg_ctl reload). It's more > complicated to switch the modes on each servers. Yes, I see that a manual change of parameter is possible. But it is difficult to get the timing of the manual change correct and yet important not to get that wrong. I don't want to spend the next year answering questions on list about how that works and agreeing that it isn't ideal. We should have an optional mechanism that will turn archiving on the primary off *automatically* when the mode changes. Maybe a third mode on archive_mode to cater for this, but other ways possible also. > For clarity: the user can choose the strategy of archiving from the > following. > > 1) each primary and standby archives > 2) only primary archives > 3) only standby archives > 4) no server archives Those are all possible, but they aren't all equally usable as it stands. In my experience most people do things very simply, so (4) is the common use case. So it needs to Just Work. We need to cater for a range of use cases, from simple implementations through to complex multi-node cases. I don't think its right to assume that everybody is implementing a complex use case and so we mostly cater for that. > The user who don't want to share an archive would choose 1). If we include a feature you need to explain why its there. Asking the question doesn't mean that I'm opposed, just that I'm checking why you think its important to have that option. So, why would you want to run with multiple archives? > The user who want to share an archive and cannot accept any > increase of bandwidth would choose 4). On the other hand, > the user who can accept it would choose 2) or 3). I prefer 2) to > 3), for multiple standby in the future. And, if 3) is adopted, > I wonder if we can get a base backup. Can we get it from the > standby during recovery? That's an important feature, so we should make it "yes". (Can't understand why you've built this with the archiver active on standby node if this isn't possible). People I talk to consider "low impact on primary" to be an important aspect of this feature. Though if you forced me to prioritise I would say making (4) automatic is more important than (3). > > I agree that is the way to do it *if* the archive is not shared. But why > > would you want to *not* share the archive?? > > First of all, I'd not like to buy a machine only for an archive other than > the primary and standby. Meanwhile, if an archive is located on either > the primary or standby (which should we locate it on?), post-failure > processing is complicated. Are you saying that putting the archive on the primary is an option? What is complicated about having the archive on the standby server? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > On Wed, 2008-12-10 at 14:51 +0900, Fujii Masao wrote: >> For clarity: the user can choose the strategy of archiving from the >> following. >> >> 1) each primary and standby archives >> 2) only primary archives >> 3) only standby archives >> 4) no server archives > > Those are all possible, but they aren't all equally usable as it stands. > > In my experience most people do things very simply, so (4) is the common > use case. So it needs to Just Work. Agreed. All this talk about archiving and streaming working at the same time is very confusing. AFAICS, the patch as submitted doesn't work if archiving is disabled in the primary. Which means that strategies (2) and (4) in your list are not possible. The standby relies on the archiving and file-based log shipping to work correctly. The streaming is just an extra thing, shortcutting the normal file-based log shipping path to keep the latest WAL segment up-to-date in the standby. In the current form, is there any reason why walreceiver needs to be an integrated server process? Couldn't it just be a stand-alone program that connects to the primary and writes the received records to the right WAL file? The only reason I can see is to reliably kill it when the standby server is promoted to primary. For a solution that doesn't depend on the file-based log shipping, I think we'll need a way for the standby to request a certain starting point for the streaming when it connects. When the standby starts, it would first recover all the log segments it can obtain using recovery_command, and then connect to the primary and request to start streaming from where recovery_command stopped. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 2008-12-10 at 14:39 +0200, Heikki Linnakangas wrote: > For a solution that doesn't depend on the file-based log shipping, I > think we'll need a way for the standby to request a certain starting > point for the streaming when it connects. When the standby starts, it > would first recover all the log segments it can obtain using > recovery_command, and then connect to the primary and request to > start > streaming from where recovery_command stopped. That was already suggested and rejected because it introduces a potentially unacceptable delay in the start of synch replication - for large databases this could be hours. (I should add it was suggested by me and I now accept that it should be rejected.) -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Wed, 2008-12-10 at 14:39 +0200, Heikki Linnakangas wrote: > In the current form, is there any reason why walreceiver needs to be > an integrated server process? Couldn't it just be a stand-alone > program that connects to the primary and writes the received records > to the right WAL file? The only reason I can see is to reliably kill > it when the standby server is promoted to primary. Reasons: * integration: we have one service we stop and start, not two. We want one log, one set of commands, one set of parameters etc * cooperation: if wal receiver is a server process we can reasonably communicate the current WAL limit via shared memory. That gives us smooth flow of WAL between receiver and replay (startup process) rather than a burst of activity each time a file arrives. That helps smooth performance and minimises failover time. Without this we would need to retain the concept of archive_timeout on the primary even when streaming, which is fairly strange. * code management Other than that there isn't that much in it... We've all read the stuff about how other RDBMS come with integrated replication. We *can* make this integrated, robust and very very easy to use, yet with flexibility for a variety of purposes. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > On Wed, 2008-12-10 at 14:39 +0200, Heikki Linnakangas wrote: > >> For a solution that doesn't depend on the file-based log shipping, I >> think we'll need a way for the standby to request a certain starting >> point for the streaming when it connects. When the standby starts, it >> would first recover all the log segments it can obtain using >> recovery_command, and then connect to the primary and request to >> start >> streaming from where recovery_command stopped. > > That was already suggested and rejected because it introduces a > potentially unacceptable delay in the start of synch replication - for > large databases this could be hours. (I should add it was suggested by > me and I now accept that it should be rejected.) I don't understand that argument. If the standby is missing say 100 log files, it's not up-to-date with the primary until it has somehow obtained and replayed all those log file. It doesn't make any difference whether it obtains them over the wire via walreceiver, or via an archive. Until it has obtained and replayed all those files, it's not up-to-date, and a failover would lead to data loss. Or did I misunderstand what "start of synch replication" means? Got a pointer to the previous discussion? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Simon Riggs wrote: > * cooperation: if wal receiver is a server process we can reasonably > communicate the current WAL limit via shared memory. That gives us > smooth flow of WAL between receiver and replay (startup process) rather > than a burst of activity each time a file arrives. That helps smooth > performance and minimises failover time. Without this we would need to > retain the concept of archive_timeout on the primary even when > streaming, which is fairly strange. Does it actually do that? I can see comments suggesting that in walreceiver, but I can't find the place in xlog.c where the startup process does the waiting. > * code management > > Other than that there isn't that much in it... Ok, just making sure I wasn't missing something crucial. I agree it should be integrated. What I'm actually worried about is that this system isn't integrated enough, and having to set up the archiving, pg_standby, and the synchronous repliation itself, correctly, makes it too complex to be practical. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 2008-12-10 at 20:52 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Wed, 2008-12-10 at 14:39 +0200, Heikki Linnakangas wrote: > > > >> For a solution that doesn't depend on the file-based log shipping, I > >> think we'll need a way for the standby to request a certain starting > >> point for the streaming when it connects. When the standby starts, it > >> would first recover all the log segments it can obtain using > >> recovery_command, and then connect to the primary and request to > >> start > >> streaming from where recovery_command stopped. > > > > That was already suggested and rejected because it introduces a > > potentially unacceptable delay in the start of synch replication - for > > large databases this could be hours. (I should add it was suggested by > > me and I now accept that it should be rejected.) > > I don't understand that argument. If the standby is missing say 100 log > files, it's not up-to-date with the primary until it has somehow > obtained and replayed all those log file. It doesn't make any difference > whether it obtains them over the wire via walreceiver, or via an > archive. Until it has obtained and replayed all those files, it's not > up-to-date, and a failover would lead to data loss. > > Or did I misunderstand what "start of synch replication" means? Got a > pointer to the previous discussion? I think you just went down the same path I did before. (That's a good sign). When the WAL starts streaming the *primary* can immediately perform synchronous replication, i.e. commit waits for transfer. The *standby* has an initial lag before it catches up, whatever we do (as you say). I suggested that way initially because it simplifies the mode change. The mode change isn't really that complex, so I agreed we should change it. The two ways of doing this are/were: 1. (Initial suggestion) * allow standby to catchup * then connect and allow sync rep 2. Preferred Choice * connect to primary and allow sync rep * catch up -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Wed, 2008-12-10 at 11:52 -0800, Jeff Davis wrote: > On Wed, 2008-12-10 at 09:48 +0000, Simon Riggs wrote: > > What is complicated about having the archive on the standby server? > > > > If the storage on the standby fails, you would lose the archive, right? As well as the standby itself presumably. Either way you need to restart from a base backup. > I think there's a use case for having two identical servers, and just > setting them up to replicate synchronously. Many of these use-cases > might not even care much about write performance or the duplicity of > maintaining two copies of the archive. Yes, that's what I've said also. > They might care a lot about PITR > though, and that would be impossible if you lose the archive. Agreed, yes we need it as an option. > Do you see a cost to allowing all of the options listed by Fujii Masao? I haven't argued in favour of removing any options, so not sure what you mean. I have asked for an explanation of why certain features are needed so we can judge whether there is a simpler way of providing everything required. It may not exist. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
* Simon Riggs <simon@2ndQuadrant.com> [081210 14:58]: > I think you just went down the same path I did before. (That's a good > sign). > > When the WAL starts streaming the *primary* can immediately perform > synchronous replication, i.e. commit waits for transfer. The *standby* > has an initial lag before it catches up, whatever we do (as you say). > > I suggested that way initially because it simplifies the mode change. > The mode change isn't really that complex, so I agreed we should change > it. > > The two ways of doing this are/were: > > 1. (Initial suggestion) > * allow standby to catchup > * then connect and allow sync rep > > 2. Preferred Choice > * connect to primary and allow sync rep > * catch up Call me think, but I'm confused... In sync rep, there *can't be* any catchign up do do... i.e. if the "slave" isn't accepting the WAL the master "stops" doing *anything*... -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
On Wed, 2008-12-10 at 09:48 +0000, Simon Riggs wrote: > What is complicated about having the archive on the standby server? > If the storage on the standby fails, you would lose the archive, right? I think there's a use case for having two identical servers, and just setting them up to replicate synchronously. Many of these use-cases might not even care much about write performance or the duplicity of maintaining two copies of the archive. They might care a lot about PITR though, and that would be impossible if you lose the archive. Do you see a cost to allowing all of the options listed by Fujii Masao? Regards,Jeff Davis
On Wed, 2008-12-10 at 21:02 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > * cooperation: if wal receiver is a server process we can reasonably > > communicate the current WAL limit via shared memory. That gives us > > smooth flow of WAL between receiver and replay (startup process) rather > > than a burst of activity each time a file arrives. That helps smooth > > performance and minimises failover time. Without this we would need to > > retain the concept of archive_timeout on the primary even when > > streaming, which is fairly strange. > > Does it actually do that? I can see comments suggesting that in > walreceiver, but I can't find the place in xlog.c where the startup > process does the waiting. Not yet... we agreed it would do that a few days ago. This thread, Fri 5 Dec. > > * code management > > > > Other than that there isn't that much in it... > > Ok, just making sure I wasn't missing something crucial. I agree it > should be integrated. What I'm actually worried about is that this > system isn't integrated enough, and having to set up the archiving, > pg_standby, and the synchronous replication itself, correctly, makes it > too complex to be practical. I'm worried about the complexity also. If we didn't use the existing archiving mechanism we'd need to invent something that looks just like it. If I could get rid of pg_standby as well, I would. I've got no qualms about chopping stuff I wrote, as long as we do it for a good reason. Keeping the parts of the old model that make sense means less code and less process change for existing users. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Wed, 2008-12-10 at 20:04 +0000, Simon Riggs wrote: > > They might care a lot about PITR > > though, and that would be impossible if you lose the archive. > > Agreed, yes we need it as an option. > > > Do you see a cost to allowing all of the options listed by Fujii Masao? > > I haven't argued in favour of removing any options, so not sure what you > mean. I have asked for an explanation of why certain features are needed > so we can judge whether there is a simpler way of providing everything > required. It may not exist. I was trying to provide a use-case for maintaining the archive on both primary and standby, i.e. option (1). My understanding was that you were asking for such a use case with this question: "So, why would you want to run with multiple archives?" Regards,Jeff Davis
Simon Riggs wrote: > When the WAL starts streaming the *primary* can immediately perform > synchronous replication, i.e. commit waits for transfer. Until the standby has obtained all the missing log files, it's not up-to-date, and there's no guarantee that it can finish the replay. For example, imagine that your archive_command is an scp from the primary to the standby. If a lightning strikes the primary before some WAL file has been copied over to the archive directory in the standby, the standby can't catch up. In the primary then, what's the point for a commit to wait for transfer, if the reply from the standby doesn't guarantee that the transaction is safe in the standby? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, 2008-12-11 at 09:44 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > When the WAL starts streaming the *primary* can immediately perform > > synchronous replication, i.e. commit waits for transfer. > > Until the standby has obtained all the missing log files, it's not > up-to-date, and there's no guarantee that it can finish the replay. For > example, imagine that your archive_command is an scp from the primary to > the standby. If a lightning strikes the primary before some WAL file has > been copied over to the archive directory in the standby, the standby > can't catch up. In the primary then, what's the point for a commit to > wait for transfer, if the reply from the standby doesn't guarantee that > the transaction is safe in the standby? The WAL files will have already left the primary. Timeline is this in my understanding 1 [Primary] Set up continuous archiving 2 [Primary] Take base backup 3 [Standby] Connect to primary to initiate streaming 4 [Primary] Log switch and, optionally, turn off archiving 5 [Standby] Begin replaying files, initially from archive 6 [Standby] Switch to replaying WAL records immediately after streaming So sync rep would turn on after step 4, so that all intermediate WAL files have been sent to the archive. If we lose the Primary after this point then all transactions are accessible to standby. If we lose the Standby or Archive, then we need to replace them and re-run the above. The above was outlined on thread "Synchronous Log Shipping Replication" and pretty much all agreed on 18 Sep. Recent changes I have requested in the architecture are: * making archiving optional on primary, so we don't need to send WAL data *twice*. * allowing streaming/startup process to work together via shared memory, to reduce average replication delay and improve performance * skip archiving/de-archiving step on standby because it's superfluous (all on this thread) All of those are fairly minor code changes, but reduce complexity of solution and significantly reduce the amount of copying of WAL files (3 copy actions to/from archive removed without loss of robustness). I would have made the suggestions earlier but it wasn't until I saw the architecture diagrams that I understood the intention of the code. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > On Thu, 2008-12-11 at 09:44 +0200, Heikki Linnakangas wrote: >> Simon Riggs wrote: >>> When the WAL starts streaming the *primary* can immediately perform >>> synchronous replication, i.e. commit waits for transfer. >> Until the standby has obtained all the missing log files, it's not >> up-to-date, and there's no guarantee that it can finish the replay. For >> example, imagine that your archive_command is an scp from the primary to >> the standby. If a lightning strikes the primary before some WAL file has >> been copied over to the archive directory in the standby, the standby >> can't catch up. In the primary then, what's the point for a commit to >> wait for transfer, if the reply from the standby doesn't guarantee that >> the transaction is safe in the standby? > > The WAL files will have already left the primary. > > Timeline is this in my understanding > 1 [Primary] Set up continuous archiving > 2 [Primary] Take base backup > 3 [Standby] Connect to primary to initiate streaming > 4 [Primary] Log switch and, optionally, turn off archiving > 5 [Standby] Begin replaying files, initially from archive > 6 [Standby] Switch to replaying WAL records immediately after streaming > > So sync rep would turn on after step 4, so that all intermediate WAL > files have been sent to the archive. If we lose the Primary after this > point then all transactions are accessible to standby. If we lose the > Standby or Archive, then we need to replace them and re-run the above. Between steps 4 and 5, there's no guarantee that all WAL files generated after step 3 and the start of streaming have already been archived. There's a delay between writing a WAL file and when the file has been safely archived. If you lose the primary during that window, the standby will have old WAL files in the archive, the most recent ones in received by walreceiver, but it's missing the WAL files generated just before the switch to streaming mode. > Recent changes I have requested in the architecture are: > * making archiving optional on primary, so we don't need to send WAL > data *twice*. Agreed. I'm not so much worried about the bandwidth, but it's a lot of extra work from administration point of view. It's very hard to get it right, so that you eliminate windows like the above. As the patch stands, if you turn off archiving in the primary, and the standby ever disconnects, even for only a few seconds, the standby will miss any WAL generated until it reconnects, and without archiving there's no way for the standby to get hold of the missed WAL. > * allowing streaming/startup process to work together via shared memory, > to reduce average replication delay and improve performance > * skip archiving/de-archiving step on standby because it's superfluous > (all on this thread) > > All of those are fairly minor code changes, but reduce complexity of > solution and significantly reduce the amount of copying of WAL files (3 > copy actions to/from archive removed without loss of robustness). I > would have made the suggestions earlier but it wasn't until I saw the > architecture diagrams that I understood the intention of the code. To make archiving optional in the primary, I don't see any other choice than adding the capability for the standby to request arbitrary WAL files from the primary, over the wire. That seems like a pretty significant change to walsender: it needs to be able to read WAL not only from wal_buffers, but from files. That would be a good idea for performance reasons, too: currently if there's a network glitch and the primary doesn't get acknowledgements from the standby for a short while, XLogInserts in the primary will block waiting for the standby after wal_buffers fills up. That's not a big deal for synchronous replication, but in asynchronous mode you don't want network glitches like that to stall the primary. And of course it means changes in the startup code as well. And we'll need bookkeeping in the primary of what WAL the standby has already received, so that it doesn't recycle the WAL segments until they've been sent to the standby. Or alternatively, the primary needs to be able to retrieve segments from the archive, but then we're dependent on archiving again. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, 2008-12-11 at 11:29 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Thu, 2008-12-11 at 09:44 +0200, Heikki Linnakangas wrote: > >> Simon Riggs wrote: > >>> When the WAL starts streaming the *primary* can immediately perform > >>> synchronous replication, i.e. commit waits for transfer. > >> Until the standby has obtained all the missing log files, it's not > >> up-to-date, and there's no guarantee that it can finish the replay. For > >> example, imagine that your archive_command is an scp from the primary to > >> the standby. If a lightning strikes the primary before some WAL file has > >> been copied over to the archive directory in the standby, the standby > >> can't catch up. In the primary then, what's the point for a commit to > >> wait for transfer, if the reply from the standby doesn't guarantee that > >> the transaction is safe in the standby? > > > > The WAL files will have already left the primary. > > > > Timeline is this in my understanding > > 1 [Primary] Set up continuous archiving > > 2 [Primary] Take base backup > > 3 [Standby] Connect to primary to initiate streaming > > 4 [Primary] Log switch and, optionally, turn off archiving > > 5 [Standby] Begin replaying files, initially from archive > > 6 [Standby] Switch to replaying WAL records immediately after streaming > > > > So sync rep would turn on after step 4, so that all intermediate WAL > > files have been sent to the archive. If we lose the Primary after this > > point then all transactions are accessible to standby. If we lose the > > Standby or Archive, then we need to replace them and re-run the above. > > Between steps 4 and 5, there's no guarantee that all WAL files generated > after step 3 and the start of streaming have already been archived. > There's a delay between writing a WAL file and when the file has been > safely archived. If you lose the primary during that window, the standby > will have old WAL files in the archive, the most recent ones in received > by walreceiver, but it's missing the WAL files generated just before the > switch to streaming mode. I was presuming that the synchronisation was clear, but I'm sorry it wasn't. Sync rep would begin only *after* the last WAL file was archived. > > Recent changes I have requested in the architecture are: > > * making archiving optional on primary, so we don't need to send WAL > > data *twice*. > > Agreed. I'm not so much worried about the bandwidth, but it's a lot of > extra work from administration point of view. It's very hard to get it > right, so that you eliminate windows like the above. > > As the patch stands, if you turn off archiving in the primary, and the > standby ever disconnects, even for only a few seconds, the standby will > miss any WAL generated until it reconnects, and without archiving > there's no way for the standby to get hold of the missed WAL. I described earlier that archiving would turn back on again if the replication ever failed (with correct synchronisation). All I've asked for is the ability to turn on and turn back on archiving, yes, with synchronisation so its safe. Personally, I think people will laugh if we tell them we decided to ship all the data twice and couldn't see another way. That's the kind of thing people give presentations at PGcon about... > > * allowing streaming/startup process to work together via shared memory, > > to reduce average replication delay and improve performance > > * skip archiving/de-archiving step on standby because it's superfluous > > (all on this thread) > > > > All of those are fairly minor code changes, but reduce complexity of > > solution and significantly reduce the amount of copying of WAL files (3 > > copy actions to/from archive removed without loss of robustness). I > > would have made the suggestions earlier but it wasn't until I saw the > > architecture diagrams that I understood the intention of the code. > > To make archiving optional in the primary, I don't see any other choice > than adding the capability for the standby to request arbitrary WAL > files from the primary, over the wire. I don't think that's the only or even a desirable way. We cannot allow a build up of WAL files to occur on the primary. Making archiving optional isn't the big deal you're saying it is. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Hi, On Thu, Dec 11, 2008 at 7:09 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> > Recent changes I have requested in the architecture are: >> > * making archiving optional on primary, so we don't need to send WAL >> > data *twice*. >> >> Agreed. I'm not so much worried about the bandwidth, but it's a lot of >> extra work from administration point of view. It's very hard to get it >> right, so that you eliminate windows like the above. >> >> As the patch stands, if you turn off archiving in the primary, and the >> standby ever disconnects, even for only a few seconds, the standby will >> miss any WAL generated until it reconnects, and without archiving >> there's no way for the standby to get hold of the missed WAL. > > I described earlier that archiving would turn back on again if the > replication ever failed (with correct synchronisation). > > All I've asked for is the ability to turn on and turn back on archiving, > yes, with synchronisation so its safe. > > Personally, I think people will laugh if we tell them we decided to ship > all the data twice and couldn't see another way. That's the kind of > thing people give presentations at PGcon about... OK, I will add such archiving feature. My new design of archiving is as follows. Primary ---------- I extend archive_mode as follows and make the user be able to choose the archiving strategy on the primary. - always The primary always archives the WAL. This is compatible with current (<=8.3) archive_mode = on. - none The primary always doesn't archive the WAL. This is compatible with current archive_mode = off. - standalone The primary doesn't archive the WAL only during replication. If replication is not in progress, the primaryarchives the WAL. That is, the primary switches the modes whenever replication starts / ends. [FLS->SLS] When replication starts, the primary disable archiving *after* the switched WAL file is archived. WAL streamingdoesn't need to wait for disablement of archiving, so the processing on the primary isn't blocked by starting ofreplication. But, both WAL streaming and archiving would be in progress for a while (until the switched WAL file is archived)after replication starts. [SLS->FLS] When replication starts, the primary restarts archiving immediately. This also doesn't block the processing onthe primary. But, this might cause loss of some files from an archive if archiving is slow on the standby. The primaryshould look for the last archived file (by the standby) from an archive and restart archiving from the subsequentfile? Of course, the primary cannot archive it if it's already removed on the primary. Standby ----------- I would add new option for achiving during recovery into recovery.conf (recovery_archive_mode). Though this option is similar to archive_mode, merging them would confuse the user more, I think. Or, I should merge? And, do you want to configure the archive command only for recovery? If so, I would add new option to specify the archive command during recovery (recovery_archive_command). Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Hi, On Thu, Dec 11, 2008 at 7:09 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Thu, 2008-12-11 at 11:29 +0200, Heikki Linnakangas wrote: >> Simon Riggs wrote: >> > On Thu, 2008-12-11 at 09:44 +0200, Heikki Linnakangas wrote: >> >> Simon Riggs wrote: >> >>> When the WAL starts streaming the *primary* can immediately perform >> >>> synchronous replication, i.e. commit waits for transfer. >> >> Until the standby has obtained all the missing log files, it's not >> >> up-to-date, and there's no guarantee that it can finish the replay. For >> >> example, imagine that your archive_command is an scp from the primary to >> >> the standby. If a lightning strikes the primary before some WAL file has >> >> been copied over to the archive directory in the standby, the standby >> >> can't catch up. In the primary then, what's the point for a commit to >> >> wait for transfer, if the reply from the standby doesn't guarantee that >> >> the transaction is safe in the standby? >> > >> > The WAL files will have already left the primary. >> > >> > Timeline is this in my understanding >> > 1 [Primary] Set up continuous archiving >> > 2 [Primary] Take base backup >> > 3 [Standby] Connect to primary to initiate streaming >> > 4 [Primary] Log switch and, optionally, turn off archiving >> > 5 [Standby] Begin replaying files, initially from archive >> > 6 [Standby] Switch to replaying WAL records immediately after streaming >> > >> > So sync rep would turn on after step 4, so that all intermediate WAL >> > files have been sent to the archive. If we lose the Primary after this >> > point then all transactions are accessible to standby. If we lose the >> > Standby or Archive, then we need to replace them and re-run the above. >> >> Between steps 4 and 5, there's no guarantee that all WAL files generated >> after step 3 and the start of streaming have already been archived. >> There's a delay between writing a WAL file and when the file has been >> safely archived. If you lose the primary during that window, the standby >> will have old WAL files in the archive, the most recent ones in received >> by walreceiver, but it's missing the WAL files generated just before the >> switch to streaming mode. Yes, since such standby is unsafe, the user must not promote it to the primary. Then, the user has to stop the standby (don't complete recovery), restart the primary and restart the standby. > > I was presuming that the synchronisation was clear, but I'm sorry it > wasn't. Sync rep would begin only *after* the last WAL file was > archived. Agreed. In order for the user to confirm whether replication began or not, we might need to log the name of the switched WAL file. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, 2008-12-10 at 15:06 -0500, Aidan Van Dyk wrote: > Call me think, but I'm confused... In sync rep, there *can't be* any > catchign up do do... i.e. if the "slave" isn't accepting the WAL the > master "stops" doing *anything*... In normal/steady state, yes, you are correct. But there is more... The simplest way to configure standby would be to freeze the primary while we setup the standby and then go straight into normal/steady state. That could mean hours of downtime for large databases, which is unacceptable in a feature aimed at increasing availability. So we need to allow the primary to continue working while the standby is setup. That then creates a log gap between the LSN of the primary and the LSN of the standby, which must be resolved. So the catchup occurs during the transient initial phase when standby is catching up with primary before they continue together in normal/steady state. Most of the architectural discussion over last few months has been about the need for the initial state and how to handle it. Most of the code complexity also. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Thu, 2008-12-11 at 19:19 +0900, Fujii Masao wrote: > > > > All I've asked for is the ability to turn on and turn back on archiving, > > yes, with synchronisation so its safe. > >(snip) > OK, I will add such archiving feature. My new design of archiving is as follows. > > Primary > ---------- > I extend archive_mode as follows and make the user be able to choose the > archiving strategy on the primary. > > - always > The primary always archives the WAL. This is compatible with current (<=8.3) > archive_mode = on. > > - none > The primary always doesn't archive the WAL. This is compatible with current > archive_mode = off. > > - standalone > The primary doesn't archive the WAL only during replication. If replication is > not in progress, the primary archives the WAL. That is, the primary switches > the modes whenever replication starts / ends. > > [FLS->SLS] > When replication starts, the primary disable archiving *after* the switched > WAL file is archived. WAL streaming doesn't need to wait for disablement > of archiving, so the processing on the primary isn't blocked by starting of > replication. But, both WAL streaming and archiving would be in progress > for a while (until the switched WAL file is archived) after > replication starts. I'm OK with that, but that is slightly different from what Heikki had said in relation to the point at which sync rep begins on primary, so he may have a different view. synchronous_replication means "if we a standby server has connected to us we will wait for all WAL associated with a transaction to be transferred prior to commit". So there is never a 100% guarantee that the transaction is safe, just an "if possible, 100%". So this implements the equivalent of DRBD Protocol A and B. Do we have an option to allow the WALreceiver to fsync the WAL file after a commit is received, which would make it equivalent to Protocol C? If we don't, I'm OK with that since it reduces performance so much it isn't a practical option in many cases. http://www.drbd.org/users-guide/s-replication-protocols.html > [SLS->FLS] > When replication starts, the primary restarts archiving immediately. This > also doesn't block the processing on the primary. But, this might cause > loss of some files from an archive if archiving is slow on the standby. > The primary should look for the last archived file (by the standby) from > an archive and restart archiving from the subsequent file? Of course, > the primary cannot archive it if it's already removed on the primary. Standby will always have kept enough files to allow it to restart from the last restartpoint, so a gap in the file sequence is unlikely. As long as we archive the WAL file that contains the last LSN we transferred before streaming failed. That conceivably might mean we need to write a .ready message after a WAL file filled, which might mean we have problems if the replication timeout is longer than the checkpoint timeout, but that seems an unlikely configuration. And if anybody has a problem with that we just recommend they use the "always" mode. > Standby > ----------- > I would add new option for achiving during recovery into recovery.conf > (recovery_archive_mode). Though this option is similar to archive_mode, > merging them would confuse the user more, I think. Or, I should merge? > And, do you want to configure the archive command only for recovery? > If so, I would add new option to specify the archive command during > recovery (recovery_archive_command). I think if you really want two archives or archiving during recovery then this is desirable to avoid confusion. Explaining all this in the docs will be fun. :-) -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
* Simon Riggs <simon@2ndQuadrant.com> [081211 05:45]: > > On Wed, 2008-12-10 at 15:06 -0500, Aidan Van Dyk wrote: > > > Call me think, but I'm confused... In sync rep, there *can't be* any > > catchign up do do... i.e. if the "slave" isn't accepting the WAL the > > master "stops" doing *anything*... > > In normal/steady state, yes, you are correct. But there is more... > > The simplest way to configure standby would be to freeze the primary > while we setup the standby and then go straight into normal/steady > state. That could mean hours of downtime for large databases, which is > unacceptable in a feature aimed at increasing availability. So we need > to allow the primary to continue working while the standby is setup. > That then creates a log gap between the LSN of the primary and the LSN > of the standby, which must be resolved. > > So the catchup occurs during the transient initial phase when standby is > catching up with primary before they continue together in normal/steady > state. But "catchup" *has* to be *done* before PostgreSQL can enter "sync rep". So, if I start PostgreSQL in sync rep mode, without any capable clients to rep with.... But I'ld rather be buggered there then find out tonight at 3am that it was in sync rep mode but wasn't really doing sync rep, becus I'ld messed up something somewhere (firewall, config, password, anything) and ther ewas not "caught up" client at the time, and I've just lost a days' worth of my $$$$$ transactions... > Most of the architectural discussion over last few months has been about > the need for the initial state and how to handle it. Most of the code > complexity also. Well, for me, I'm quite happy with a "restart/stop&start" being a necessary "downtime" to move to synchronous replication. This way, I could see a "setup" routing that looks like: 1) Current "production" DB does normal backups/PITR/WAL archiving 2) I setup new "slave", which involves - restore from backup + wal recover (pg_standby type) - Could take days+++ - Ohwell.... 3) Stop production 4) so, now slave is caught up... 5) Start "production" now in sync rep mode as master 6) start slave in sync-rep mode as slave... So downtime would be limited to the time from the old postmaster shutdown to the time the slave has replayed the last WAL and connected to the restarted postmaster as a sync rep slave... Or am I way too naive to think that a small downtime to "switch" from non-sync-rep to sync-rep is acceptable... a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
* Fujii Masao <masao.fujii@gmail.com> [081211 05:25]: > - standalone > The primary doesn't archive the WAL only during replication. If replication is > not in progress, the primary archives the WAL. That is, the primary switches > the modes whenever replication starts / ends. That scares the hebegebies out of me... I'm doing sync-rep because I *really* *want* *my* *data* .... *always* ... I want sync-rep because I'm going to get even *stonger* guarentees on my data (and, if hot-standby works out, load balancing too, but thats not *my* primary desire for sync-rep)... But I'm sure as hell *not* going to throw all my eggs into that slave's basket and do away with my WAL archive... Would anyone actually use that "standby" mode, and if not, why compilcate the code for it? a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
On Thu, 2008-12-11 at 09:27 -0500, Aidan Van Dyk wrote: > But "catchup" *has* to be *done* before PostgreSQL can enter "sync rep". Not true. Please reread the thread where Heikki questions that and I reply. This was Fujii-san's idea, which I now agree with. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Thu, 2008-12-11 at 09:37 -0500, Aidan Van Dyk wrote: > * Fujii Masao <masao.fujii@gmail.com> [081211 05:25]: > > > - standalone > > The primary doesn't archive the WAL only during replication. If > replication is > > not in progress, the primary archives the WAL. That is, the > primary switches > > the modes whenever replication starts / ends. > But I'm sure as hell *not* going to throw all my eggs into that > slave's > basket and do away with my WAL archive... Would anyone actually use > that "standby" mode, and if not, why compilcate the code for it? Sending data twice is not a requirement I ever heard expressed, nor has the lack of ability to send it twice been voiced as a criticism for any form of replication I'm familiar with. Ask the DRBD guys if sending data twice is necessary or required to make replication work. If multiple people think its a good idea then I respect your choice of option. But I also think that many or perhaps most people will choose not to send data twice and I respect that choice of option also. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
* Simon Riggs <simon@2ndQuadrant.com> [081211 10:03]: > Sending data twice is not a requirement I ever heard expressed, nor has > the lack of ability to send it twice been voiced as a criticism for any > form of replication I'm familiar with. Ask the DRBD guys if sending data > twice is necessary or required to make replication work. > > If multiple people think its a good idea then I respect your choice of > option. > > But I also think that many or perhaps most people will choose not to > send data twice and I respect that choice of option also. Well, PostgreSQL has WAL, so we've already accepted the notion of "send data twice" being useful sometimes... But I would note that the "archive" and "streaming" are both sending the data *different* places... or at least, in my case would be... And, also, I know WAL archiving isn't necessary for replication to work. but it's necessary for me to sleep comfortably at night ;-) I'm just suprised that people are willing to throw away their backup/PITR archiving once they have a singl "live slave" up. a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
* Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> [081211 10:09]: > Simon Riggs wrote: >> On Thu, 2008-12-11 at 09:27 -0500, Aidan Van Dyk wrote: >> >>> But "catchup" *has* to be *done* before PostgreSQL can enter "sync rep". >> >> Not true. Please reread the thread where Heikki questions that and I >> reply. This was Fujii-san's idea, which I now agree with. > > I think the confusion here is about what exactly "sync rep" means in > this situation. It's true that you can start streaming the WAL before > the standby has fully caught up. But from the client's point of view, > there's not much point in streaming the log *synchronously* and making > the client to wait for the acknowledment from the standby, if the > acknowledgment from the standby that WAL has be streamed up to point X, > doesn't actually guarantee that the slave can recover all the way to > that point. Quite possibly a terminology problem.. I my case I said "sync rep" meaning the mode such that the transaction doesn't commit successfully for my PG client until the xlog record has been "streamed" to the client... and I understand that at his presentation at PGcon, Fujii-san there could be possible variants on when the "streamed" is considered done based on network, slave ram, disk, application, etc. a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
On Thu, 2008-12-11 at 17:07 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Thu, 2008-12-11 at 09:27 -0500, Aidan Van Dyk wrote: > > > >> But "catchup" *has* to be *done* before PostgreSQL can enter "sync rep". > > > > Not true. Please reread the thread where Heikki questions that and I > > reply. This was Fujii-san's idea, which I now agree with. > > I think the confusion here is about what exactly "sync rep" means in > this situation. It's true that you can start streaming the WAL before > the standby has fully caught up. Yep. > But from the client's point of view, > there's not much point in streaming the log *synchronously* and making > the client to wait for the acknowledment from the standby, if the > acknowledgment from the standby that WAL has be streamed up to point X, > doesn't actually guarantee that the slave can recover all the way to > that point. I disagree. This morning I showed it was possible, given the synchronisation I outlined. There is a slight relaxation of that in the current proposal, so you need to take that up if you see any problem there. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > On Thu, 2008-12-11 at 09:27 -0500, Aidan Van Dyk wrote: > >> But "catchup" *has* to be *done* before PostgreSQL can enter "sync rep". > > Not true. Please reread the thread where Heikki questions that and I > reply. This was Fujii-san's idea, which I now agree with. I think the confusion here is about what exactly "sync rep" means in this situation. It's true that you can start streaming the WAL before the standby has fully caught up. But from the client's point of view, there's not much point in streaming the log *synchronously* and making the client to wait for the acknowledment from the standby, if the acknowledgment from the standby that WAL has be streamed up to point X, doesn't actually guarantee that the slave can recover all the way to that point. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, 2008-12-11 at 19:19 +0900, Fujii Masao wrote: > My new design of archiving is as follows. So far I haven't asked about running multiple standby servers and don't recall having seen this mentioned anywhere. Forgive me if this was already mentioned. The idea is that we would be able to have multiple standby servers connecting to one primary, yes? It would be useful to have sync replication work that it must get an acknowledgement from at least one standby before it continues. Or do you think we would stream to just one standby, then use the archiver (primary or standby) to keep sending files to allow multiple additional standby nodes? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Hi, On Fri, Dec 12, 2008 at 12:15 AM, Aidan Van Dyk <aidan@highrise.ca> wrote: > * Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> [081211 10:09]: >> Simon Riggs wrote: >>> On Thu, 2008-12-11 at 09:27 -0500, Aidan Van Dyk wrote: >>> >>>> But "catchup" *has* to be *done* before PostgreSQL can enter "sync rep". >>> >>> Not true. Please reread the thread where Heikki questions that and I >>> reply. This was Fujii-san's idea, which I now agree with. >> >> I think the confusion here is about what exactly "sync rep" means in >> this situation. It's true that you can start streaming the WAL before >> the standby has fully caught up. But from the client's point of view, >> there's not much point in streaming the log *synchronously* and making >> the client to wait for the acknowledment from the standby, if the >> acknowledgment from the standby that WAL has be streamed up to point X, >> doesn't actually guarantee that the slave can recover all the way to >> that point. > > Quite possibly a terminology problem.. I my case I said "sync rep" > meaning the mode such that the transaction doesn't commit successfully > for my PG client until the xlog record has been "streamed" to the > client... and I understand that at his presentation at PGcon, Fujii-san > there could be possible variants on when the "streamed" is considered > done based on network, slave ram, disk, application, etc. I'd like to define the meaning of "synch rep" again. "synch rep" means: (1) Transaction commit waits for WAL records to be replicated to the standby before the command returns a "success" indicationto the client. (2) The standby has (can read) all WAL files indispensable for recovery. If both are true, your system is in "synch rep"; you can perform failover safely without any transaction loss whenever the primary falls down. On the other hand, if either is false, your system is in not "synch rep" but "standalone"; the failure of the primary might cause a certain transaction loss. Starting the standby doesn't mean "synch rep" directly. We have to wait for (1) *and* (2) after starting the standby. (1) is reported as a server log message, so we can wait for (1). (2) is somewhat complicated; if an archive is shared, the server log message for achiving indicates (2). otherwise, The copy operation (copy indispensable WAL files from the primary to the standby) by the user or clusterware indicates (2). But, as Simon pointed out, since many people share an archive, they should monitor only the server log messages. Or, should I create the feature for the user to confirm whether it's in "synch rep" via SQL? Since there is a little delay between (1) and (2), we can do WAL streaming asynchronously only in the delay, as Heikki pointed out. But I'm not sure if it's worth trying it. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Hi, > The idea is that we would be able to have multiple standby servers > connecting to one primary, yes? It would be useful to have sync > replication work that it must get an acknowledgement from at least one > standby before it continues. No, in my current patch, only one standby can perform WAL streaming. Of course, Yes in the future (8.5?). > > Or do you think we would stream to just one standby, then use the > archiver (primary or standby) to keep sending files to allow multiple > additional standby nodes? Interesting! and Yes, we can. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
* Fujii Masao <masao.fujii@gmail.com> [081211 23:00]: > Hi, > Or, should I > create the feature for the user to confirm whether it's in "synch rep" via SQL? I don't need a way to check via SQL, but I'ld love a postgresql.conf option that when set would make sure that all connections pretty much just hang until a slave has connected and everything is setup for "sync rep". I think I saw that youre using "normal" connection setup to start the wal streaming to the slave, so you have to allow connections, but I'ld really not want any of my pg-clients able to do anything if sync-rep isn't happenning... a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Hi, On Fri, Dec 12, 2008 at 1:34 PM, Aidan Van Dyk <aidan@highrise.ca> wrote: > * Fujii Masao <masao.fujii@gmail.com> [081211 23:00]: >> Hi, > >> Or, should I >> create the feature for the user to confirm whether it's in "synch rep" via SQL? > > I don't need a way to check via SQL, but I'ld love a postgresql.conf > option that when set would make sure that all connections pretty much > just hang until a slave has connected and everything is setup for "sync > rep". I think I saw that youre using "normal" connection setup to start > the wal streaming to the slave, so you have to allow connections, but > I'ld really not want any of my pg-clients able to do anything if > sync-rep isn't happenning... How about stopping the request / connection from a client in front of postgres (e.g. connection pooling software)? Or, we should develop the feature like OFFLINE of Oracle apart from Synch Rep at first. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, 2008-12-12 at 12:53 +0900, Fujii Masao wrote: > > > > Quite possibly a terminology problem.. I my case I said "sync rep" > > meaning the mode such that the transaction doesn't commit successfully > > for my PG client until the xlog record has been "streamed" to the > > client... and I understand that at his presentation at PGcon, Fujii-san > > there could be possible variants on when the "streamed" is considered > > done based on network, slave ram, disk, application, etc. > > I'd like to define the meaning of "synch rep" again. "synch rep" means: > > (1) Transaction commit waits for WAL records to be replicated to the standby > before the command returns a "success" indication to the client. > (2) The standby has (can read) all WAL files indispensable for recovery. I would change "can read" in (2) to "has access to". "Can read" implies we have read all files and checked CRCs of individual records. The crux of this is what we mean by "synchronous_replication = on". There are two possible meanings: 1. Commit will wait only if streaming is available and has waited for all necessary startup conditions. This provides "Highest Availability" 2. Commit will wait *until* full sync rep is available. So we don't allow it until standby fails and also don't allow it if standby goes down. This provides "Highest Transaction Durability", though is fairly fragile. Other systems recommend use of multiple standby nodes if this option is selected. Perhaps we should add this as a third option to synchronous_replication, so we have either off, on, only So far I realise I've been talking exclusively about (1). In that mode synchronous_replication = on would wait for streaming to complete even if last WAL file not fully transferred. For (2) we need a full interlock. Given that we don't currently support multiple streamed standby servers, it seems not much point in implementing the interlock (2) would require. Should we leave that part for 8.5, or do it now? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
* Simon Riggs <simon@2ndQuadrant.com> [081212 08:20]: > > 2. Commit will wait *until* full sync rep is available. So we don't > allow it until standby fails and also don't allow it if standby goes > down. > This provides "Highest Transaction Durability", though is fairly > fragile. Other systems recommend use of multiple standby nodes if this > option is selected. yes please! > Perhaps we should add this as a third option to synchronous_replication, > so we have either off, on, only > > So far I realise I've been talking exclusively about (1). In that mode > synchronous_replication = on would wait for streaming to complete even > if last WAL file not fully transferred. Seems reasonable... > For (2) we need a full interlock. Given that we don't currently support > multiple streamed standby servers, it seems not much point in > implementing the interlock (2) would require. Should we leave that part > for 8.5, or do it now? Ugh... If all sync-rep is gong to give is "if it's working, the commit made it the slaves, but it might not be working [anymore|yet], but you (the app using pg) have no way of knowing...", that sort of defeats the point ;-) I'ld love multiple slaves, but I understand that's not in the current work, and I understand that it might be hard with the accept & become wall-sender approach. It should be very easy to make a walsender handle "multiple" slaves, and voting of quorum/etc as "successfully on slave", except that we need to get the multiple connections to the walsender backend... a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
On Fri, 2008-12-12 at 12:53 +0900, Fujii Masao wrote: > Or, should I create the feature for the user to confirm whether it's in > "synch rep" via SQL? > I think this would be useful. Regards,Jeff Davis
On Fri, 2008-12-12 at 08:57 -0500, Aidan Van Dyk wrote: > > For (2) we need a full interlock. Given that we don't currently support > > multiple streamed standby servers, it seems not much point in > > implementing the interlock (2) would require. Should we leave that part > > for 8.5, or do it now? > > Ugh... If all sync-rep is gong to give is "if it's working, the commit > made it the slaves, but it might not be working [anymore|yet], but you > (the app using pg) have no way of knowing...", that sort of defeats the > point ;-) http://archives.postgresql.org/pgsql-hackers/2008-12/msg00865.php Fujii Masao offers to provide a SQL function that will tell you definitively whether you are in full sync rep, or some degraded mode. I assume that there will also be server log messages to identify whether you ever left sync rep mode. Regards,Jeff Davis
* Jeff Davis <pgsql@j-davis.com> [081212 13:41]: > On Fri, 2008-12-12 at 08:57 -0500, Aidan Van Dyk wrote: > > > For (2) we need a full interlock. Given that we don't currently support > > > multiple streamed standby servers, it seems not much point in > > > implementing the interlock (2) would require. Should we leave that part > > > for 8.5, or do it now? > > > > Ugh... If all sync-rep is gong to give is "if it's working, the commit > > made it the slaves, but it might not be working [anymore|yet], but you > > (the app using pg) have no way of knowing...", that sort of defeats the > > point ;-) > > http://archives.postgresql.org/pgsql-hackers/2008-12/msg00865.php > > Fujii Masao offers to provide a SQL function that will tell you > definitively whether you are in full sync rep, or some degraded mode. I > assume that there will also be server log messages to identify whether > you ever left sync rep mode. So when would I have to call that function? Before begin, after begin, before commit, or all, to guarentee that know that my application is suppose to "delay" calling commit until when sync-mode is actualyl synchronous? And then afterwards, I have to call it again t omake sure it didn't fall "out of" mode between my previous call and the commit actually working? Bugger it, then I'll have to to patch every single app/query that writes transactions to the database to be "sync rep" aware... And if I miss one... Some might say that if the data's that important, that audit/patching to be "sync rep" aware is worth it, but then I guess they say that then you might as well do application level replication as well ;-) a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
On Fri, 2008-12-12 at 14:23 -0500, Aidan Van Dyk wrote: > So when would I have to call that function? Before begin, after begin, > before commit, or all, to guarentee that know that my application is > suppose to "delay" calling commit until when sync-mode is actualyl > synchronous? And then afterwards, I have to call it again t omake sure > it didn't fall "out of" mode between my previous call and the commit > actually working? I'm not suggesting that applications call the function. It's a way for a monitoring system to know that you're in a degraded state and notify you. I'm not sure I entirely understand the use case you're advocating: Let's say the standby has a major failure. Now you have a single point of failure (the primary), so _all_ of your transactions are in jeopardy anyway -- at least until you get back into sync rep. Rejecting new transactions won't save your old ones. The only time it helps is when the failure is temporary, i.e. you didn't really lose the storage on the standby. But you would need to rely on some guarantee that the storage is still intact on the standby system even though the standby is unresponsive. Is that the use case? Regards,Jeff Davis
Hi, Fujii Masao wrote: > I'd like to define the meaning of "synch rep" again. "synch rep" means: > > (1) Transaction commit waits for WAL records to be replicated to the standby > before the command returns a "success" indication to the client. > > (2) The standby has (can read) all WAL files indispensable for recovery. Let me point out that - very much like the original Postgres-R algorithm - this guarantees committed transactions to be durable and consistent (no late aborts of conflicting transactions), but it does not guarantee that a transaction committed on one node is immediately visible on the other node. In that sense, it is not synchronous as commonly understood, because it does not "operate with all their parts in synchrony" [1], as implied by the term "synchronous". This might (and often has in the past) lead to confusion. It's certainly enough of a reason for me to rather use the term "eager replication". See [2] for a more in-depth explanation. I might also point out, that Jan Wieck called this very same approch "an asynchronous replication system by all means" [3]. Regards Markus Wanner [1]: Wikipedia on Synchronization http://en.wikipedia.org/wiki/Synchronization [2]: Postgres-R general mailing list, by Markus Wanner, subject: terms for database replication: synchronous vs eager http://lists.pgfoundry.org/pipermail/postgres-r-general/2008-September/000014.html [3]: Postgres General mailing list, by Jan Wieck, subject: terms for database replication: synchronous vs eager http://archives.postgresql.org/pgsql-hackers/2007-09/msg00631.php
On Sat, 2008-12-13 at 00:00 +0100, Markus Wanner wrote: > Hi, > > Fujii Masao wrote: > > I'd like to define the meaning of "synch rep" again. "synch rep" means: > > > > (1) Transaction commit waits for WAL records to be replicated to the standby > > before the command returns a "success" indication to the client. > > > > (2) The standby has (can read) all WAL files indispensable for recovery. > > Let me point out that - very much like the original Postgres-R algorithm > - this guarantees committed transactions to be durable and consistent > (no late aborts of conflicting transactions), but it does not guarantee > that a transaction committed on one node is immediately visible on the > other node. In that sense, it is not synchronous as commonly understood, > because it does not "operate with all their parts in synchrony" [1], as > implied by the term "synchronous". This might (and often has in the > past) lead to confusion. You're right that neither the data transfer nor data availability is entirely synchronous, but data transfer is synchronous at time of *commit*: it is recorded on multiple nodes at the same time. The term "synchronous replication" is already well used in the industry to mean synchronous commit, so I don't think we should change the name now. The project here is also known to everybody as "synch rep". * Oracle Data Guard calls it "synchronous redo transport" * MS Exchange calls it "synchronous replication" * MS SQL Server has "Database Mirroring", "Log Shipping" and "Replication". "Database Mirroring" provides synchronous mechanism, with "Replication" meaning data transfer to other databases, publish&subscribe. * DB2 HADR provides "synchronous replication" * MySQL call it "synchronous replication" What is confusing is that "replication" itself is a much abused term and is used to describe technologies for HA, DR and data movement. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Hi, Simon Riggs wrote: > You're right that neither the data transfer nor data availability is > entirely synchronous, but data transfer is synchronous at time of > *commit*: it is recorded on multiple nodes at the same time. I'm unsure what you mean by a "data transfer being synchronous". To what other process or state should the data transfer be synchronous to? > The term "synchronous replication" is already well used in the industry > to mean synchronous commit, so I don't think we should change the name > now. The project here is also known to everybody as "synch rep". I understand very well, that you don't want to change the name. I've been hesitant to "relabel" Postgres-R from synchronous to asynchronous to eager. However, that is a marketing decision [1], which should not be mixed with the technical discussion here. Speaking of a "synchronous commit" is utterly misleading, because the commit itself is exactly the thing that's *not* synchronous. It *is* an optimization to fully synchronous replication to defer commit on the "slave" and only make sure that the transaction *can* be applied at some time in the future. However, this *does* have the drawback of transactions not being immediately visible on the slave. Often enough, this is acceptable. But it certainly matters to some applications developers. > What is confusing is that "replication" itself is a much abused term and > is used to describe technologies for HA, DR and data movement. I absolutely agree to that. And I'm thus recommending to at least be consistent and honest with the term "synchronous" and point out that WAL writing is synchronous for the log shipping approach here (AFAIK). But that the commit is asynchronous for performance reasons. In other words: this approach is certainly (and hopefully, for performance reasons) different from a fully synchronous approach. Even for marketing reasons, it might make sense to point out that difference (.. "no, we are faster than fully sync rep."). Regards Markus Wanner [1]: Some people like the term "virtually synchronous" for marketing purposes. That's at least half-ways technically correct.
On 2008-12-13, at 13:07, Markus Wanner wrote: > > > However, that is a marketing decision [1], which should not be mixed > with the technical discussion here. Speaking of a "synchronous commit" > is utterly misleading, because the commit itself is exactly the thing > that's *not* synchronous. > > [1]: Some people like the term "virtually synchronous" for marketing > purposes. That's at least half-ways technically correct. Marketing people are virtually trustworthy, from my life experience. If you ask me, this is just preposterous.
On Sat, 2008-12-13 at 14:07 +0100, Markus Wanner wrote: > Speaking of a "synchronous commit" > is utterly misleading, because the commit itself is exactly the thing > that's *not* synchronous. Not really sure where you're going here. "synchronous replication" is used exactly as described in the Wikipedia entry here: http://en.wikipedia.org/wiki/Database_replication No two word phrase is going to accurately sum up the complexity and potential for data loss in these situations. DRBD saw that too and just called them A, B and C and then describe them more accurately. But I don't think we should say "PostgreSQL just implemented algorithm B" which is just unhelpful. I don't think its "marketing" to refer to it by the phrase most commonly used for the technology we are building. Nobody suggested we call it "wizrep" or suchlike... The docs can contain the exact description of data loss and timing windows. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Hi, Simon Riggs wrote: > On Sat, 2008-12-13 at 14:07 +0100, Markus Wanner wrote: >> Speaking of a "synchronous commit" >> is utterly misleading, because the commit itself is exactly the thing >> that's *not* synchronous. > > Not really sure where you're going here. I'm pointing to a potential misunderstanding, trying to help to prevent you from running into the same issues and discussions as I did. I've learned the hard way, that the Postgres-R algorithm is not fully synchronous (in the strict sense). This caused confusion for people who take the word "synchronous" by its original meaning. The algorithm proposed here seems similar enough to potentially cause the same confusion. As I see it now, I think it's well worth to point out the difference, from both, the technical as well as from the marketing perspective. The former for better understanding, the later to prevent users from thinking it must be slow per definition. Arguing that your approach is not fully synchronous definitely helps defending that concern. However, I'm just now realizing, that the difference is only relevant as soon as you begin to allow read-only access on the slave. AFAIK that's among the goals of this effort, no? > "synchronous replication" is > used exactly as described in the Wikipedia entry here: > http://en.wikipedia.org/wiki/Database_replication That article describes pretty much all variants of replication, what exactly are you referring to? Under "Database Replication > Multi-Master replication" it describes eager vs lazy variants, which is IMO a more appropriate and useful distinction than sync vs async. (But that's admittedly a sentence I've contributed myself, IIRC). Under "Storage Replication > Synchronous Replication" one can read: "Write is not considered complete until acknowledgement by both local and remote storage." For the proposed approach this might hold true for WAL writing. However, the user certainly doesn't care how synchronous the log is shipped nor written, is as long as she doesn't see the changes on the slave. That's the difference between fully synchronous and eager (or virtually or approximately synchronous) algorithms. You seem to refer to both as "synchronous". Phrases like "synchronous commit" or "synchronous data transfer" do not help me to understand what exactly you are talking about. Explaining that the slave commits (and therefore makes the transactions visible) asynchronously would help. And it would prevent disappointment for users who expect changes to be immediately visible on the slave. > No two word phrase is going to accurately sum up the complexity and > potential for data loss in these situations. DRBD saw that too and just > called them A, B and C and then describe them more accurately. Agreed. I've chosen lazy, eager and sync, so far. I'm open for better terms, and I leave it up to you to call your variants whatever you like. But to understand what you are talking about, I'd prefer to get to know these distinctions crisp and clear. > But I don't think we should say "PostgreSQL just implemented algorithm > B" which is just unhelpful. I don't think its "marketing" to refer to it > by the phrase most commonly used for the technology we are building. I certainly agree to using such terms. Unfortunately, in my experience, synchronous replication is commonly used to mean that transactions are guaranteed to be immediately visible on remote nodes after the client got commit acknowledgment. That's the cause for confusion I'm envisioning. I'm hoping to be somewhat helpful to this effort of getting a log shipping replication variant into Postgres. It can only be beneficial for Postgres-R in that we gain field experience with ..uhm.. this special kind of replication, however we name it. I'm already on xmas vacation, so I won't bother you any further on this issue. Have fun coding and make sure to enjoy this time of the year. All the best. Markus Wanner
> I certainly agree to using such terms. Unfortunately, in my experience, > synchronous replication is commonly used to mean that transactions are > guaranteed to be immediately visible on remote nodes after the client > got commit acknowledgment. That's the cause for confusion I'm envisioning. I think that's a very important point. It's very possible that 8.4 may support both this feature and Hot Standby (although the latter seems to have stalled a bit...). That makes me think "oh, great, I can offload any subset of my read-only queries to the standby". Not so fast. I think we need to reserve the term "synchronous replication" for a system where transactions that begin at the same time on the primary and standby see the same tuples. Clearly that is "more" synchronous than what is being proposed here; if we call this "synchronous replication", what will we call that? "Really Synchronous, Honest, No Kidding"? Admittedly, we may never implement that feature, but that seems irrelevant. It would be useful to have names for all the different possibilities.Random ideas: Log Shipping. After each log switch, the previous WAL log is copied to the standby in its entirety. WAL Streaming - Asynchronous. The WAL log is streamed from master to standby as it is written, but transactions on the master never wait. WAL Streaming - Synchronous Receive. The WAL log is streamed from master to standby as it is written, and transactions on the master wait until the standby acknowledges receipt of the WAL. WAL Streaming - Synchronous Write. The WAL log is streamed from master to standby as it is written, and transactions on the master wait until the standby acknowledges that the WAL has been written to disk. WAL Streaming - Synchronous Apply. The WAL log is streamed from master to standby as it is written, and transactions on the master wait until the standby acknowledges that WAL has been written to disk and applied. ...Robert
"Robert Haas" <robertmhaas@gmail.com> writes: > I think we need to reserve the term "synchronous replication" for a > system where transactions that begin at the same time on the primary > and standby see the same tuples. Clearly that is "more" synchronous > than what is being proposed here; if we call this "synchronous > replication", what will we call that? "Really Synchronous, Honest, No > Kidding"? Admittedly, we may never implement that feature, but that > seems irrelevant. We won't call it anything, because we never will or can implement that. See the theory of relativity: the notion of exactly simultaneous events at distinct locations isn't even well-defined, because observers at yet other locations will disagree about what is "simultaneous". And I'm not just making a joke here --- speed-of-light delays in a WAN are meaningful compared to current computer speeds. In practice, the slave and the master will never commit at exactly the same time. I agree with the point made upthread that we should use the term "synchronous replication" the way it's commonly used in the industry. Inventing our own terminology might be fun but it's not really going to result in less confusion. regards, tom lane
Synchronous replication, "sync rep" is *not* intersted in the "slave's visiblity of the commit", because PostgreSQL doesn't "serve" requests when in recovery (wal receiving) mode *now*. This sync rep patch/proposal/discution is *strictly* (at this point yet, hot standby may eventually or hopefully soon change that) the means to get the data "safely in 2 seperate places", before the COMMIT returns, by means of wal streaming. That "safely in 2 places" can have various implementation options (like received, on disk, or applied), and Fujii-san explained some of the options as to what to consider "safe" and their trade-offs at his presentation at last year. Once both sync-rep (the wal-streaming get changes in two places) and hot-standby (run queries while WAL is being applied) are available in PostgreSQL, at that point we might need to start "other client visibility", but even then, we still don't need to worry about multi-master options... a. * Markus Wanner <markus@bluegap.ch> [081213 12:17]: > Hi, > > Simon Riggs wrote: > > On Sat, 2008-12-13 at 14:07 +0100, Markus Wanner wrote: > >> Speaking of a "synchronous commit" > >> is utterly misleading, because the commit itself is exactly the thing > >> that's *not* synchronous. > > > > Not really sure where you're going here. > > I'm pointing to a potential misunderstanding, trying to help to prevent > you from running into the same issues and discussions as I did. > > I've learned the hard way, that the Postgres-R algorithm is not fully > synchronous (in the strict sense). This caused confusion for people who > take the word "synchronous" by its original meaning. The algorithm > proposed here seems similar enough to potentially cause the same confusion. > > As I see it now, I think it's well worth to point out the difference, > from both, the technical as well as from the marketing perspective. The > former for better understanding, the later to prevent users from > thinking it must be slow per definition. Arguing that your approach is > not fully synchronous definitely helps defending that concern. > > However, I'm just now realizing, that the difference is only relevant as > soon as you begin to allow read-only access on the slave. AFAIK that's > among the goals of this effort, no? > > > "synchronous replication" is > > used exactly as described in the Wikipedia entry here: > > http://en.wikipedia.org/wiki/Database_replication > > That article describes pretty much all variants of replication, what > exactly are you referring to? > > Under "Database Replication > Multi-Master replication" it describes > eager vs lazy variants, which is IMO a more appropriate and useful > distinction than sync vs async. (But that's admittedly a sentence I've > contributed myself, IIRC). > > Under "Storage Replication > Synchronous Replication" one can read: > "Write is not considered complete until acknowledgement by both local > and remote storage." For the proposed approach this might hold true for > WAL writing. However, the user certainly doesn't care how synchronous > the log is shipped nor written, is as long as she doesn't see the > changes on the slave. > > That's the difference between fully synchronous and eager (or virtually > or approximately synchronous) algorithms. You seem to refer to both as > "synchronous". Phrases like "synchronous commit" or "synchronous data > transfer" do not help me to understand what exactly you are talking about. > > Explaining that the slave commits (and therefore makes the transactions > visible) asynchronously would help. And it would prevent disappointment > for users who expect changes to be immediately visible on the slave. > > > No two word phrase is going to accurately sum up the complexity and > > potential for data loss in these situations. DRBD saw that too and just > > called them A, B and C and then describe them more accurately. > > Agreed. I've chosen lazy, eager and sync, so far. I'm open for better > terms, and I leave it up to you to call your variants whatever you like. > But to understand what you are talking about, I'd prefer to get to know > these distinctions crisp and clear. > > > But I don't think we should say "PostgreSQL just implemented algorithm > > B" which is just unhelpful. I don't think its "marketing" to refer to it > > by the phrase most commonly used for the technology we are building. > > I certainly agree to using such terms. Unfortunately, in my experience, > synchronous replication is commonly used to mean that transactions are > guaranteed to be immediately visible on remote nodes after the client > got commit acknowledgment. That's the cause for confusion I'm envisioning. > > > I'm hoping to be somewhat helpful to this effort of getting a log > shipping replication variant into Postgres. It can only be beneficial > for Postgres-R in that we gain field experience with ..uhm.. this > special kind of replication, however we name it. > > I'm already on xmas vacation, so I won't bother you any further on this > issue. Have fun coding and make sure to enjoy this time of the year. > > All the best. > > Markus Wanner > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
On Sat, 2008-12-13 at 13:05 -0500, Robert Haas wrote: > Hot Standby (although the latter > seems to have stalled a bit...) It's just being worked on asynchronously. ;-) -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Sat, 2008-12-13 at 13:05 -0500, Robert Haas wrote: > > I certainly agree to using such terms. Unfortunately, in my experience, > > synchronous replication is commonly used to mean that transactions are > > guaranteed to be immediately visible on remote nodes after the client > > got commit acknowledgment. That's the cause for confusion I'm envisioning. > > I think that's a very important point. It's very possible that 8.4 > may support both this feature and Hot Standby (although the latter > seems to have stalled a bit...). That makes me think "oh, great, I > can offload any subset of my read-only queries to the standby". Not > so fast. > > I think we need to reserve the term "synchronous replication" for a > system where transactions that begin at the same time on the primary > and standby see the same tuples. Define "same time". You can have a variantof sync rep + hot standby where the master does not return committed before the slave has both synced the data and replied the transaction so that it is visible on slave but in that case you may have a usecase, where it is actually visible on slave _before_ it is visible on master. actually you can't have that "same time" guarantee even on single system, that is, if you start two transactions connections "at the same time", you still cant be sure there is not third transaction which has committed between those two and which makes the visible data on those two different. > Clearly that is "more" synchronous > than what is being proposed here; if we call this "synchronous > replication", what will we call that? "Really Synchronous, Honest, No > Kidding"? Admittedly, we may never implement that feature, but that > seems irrelevant. > > It would be useful to have names for all the different possibilities. > Random ideas: > > Log Shipping. After each log switch, the previous WAL log is copied > to the standby in its entirety. > > WAL Streaming - Asynchronous. The WAL log is streamed from master to > standby as it is written, but transactions on the master never wait. > > WAL Streaming - Synchronous Receive. The WAL log is streamed from > master to standby as it is written, and transactions on the master > wait until the standby acknowledges receipt of the WAL. > > WAL Streaming - Synchronous Write. The WAL log is streamed from > master to standby as it is written, and transactions on the master > wait until the standby acknowledges that the WAL has been written to > disk. > > WAL Streaming - Synchronous Apply. The WAL log is streamed from > master to standby as it is written, and transactions on the master > wait until the standby acknowledges that WAL has been written to disk > and applied. We still could call Sync Rep as a feature "synchronous replication" on basis that "WAL Streaming - Synchronous Write" is the highest security level achievable using the feature. And maybe have Sync Hot Standby as a feature on top of that which provides "WAL Streaming - Synchronous Apply" ------------------------------------------ Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training
On Sat, 2008-12-13 at 21:35 +0200, Hannu Krosing wrote: > We still could call Sync Rep as a feature "synchronous replication" on > basis that "WAL Streaming - Synchronous Write" is the highest security > level achievable using the feature. > > And maybe have Sync Hot Standby as a feature on top of that which > provides "WAL Streaming - Synchronous Apply" Or maybe better call it Serializable Hot Standby, as the actual guarantee that can be achieved is that when one client does something on master and after committing on master starts another transaction on slave, then the effects of query on master are visible on slave. -- ------------------------------------------ Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training
Hi, Tom Lane wrote: > We won't call it anything, because we never will or can implement that. > See the theory of relativity: the notion of exactly simultaneous events > at distinct locations isn't even well-defined That has never been the point of the discussion. It's rather about the question if changes from transactions are guaranteed to be visible on remote nodes immediately after commit acknowledgment. Whether or not this is guaranteed, in both cases the term "synchronous replication" is commonly used, which is causing confusion. Regards Markus Wanner
Hi, Simon Riggs wrote: >> Hot Standby (although the latter >> seems to have stalled a bit...) > > It's just being worked on asynchronously. ;-) LOL, thanks for bringing humor into this discussion :-) Regards Markus Wanner
Hi, Hannu Krosing wrote: > You can have a variantof sync rep + hot standby where the master does > not return committed before the slave has both synced the data and > replied the transaction so that it is visible on slave but in that case > you may have a usecase, where it is actually visible on slave _before_ > it is visible on master. As long as it's not visible *before* the client requests a COMMIT, that certainly doesn't matter (because the application cannot check that). What matters is, that an application might expect a node to show the changes of a transaction which has previously (seen from the application itself) been committed and acknowledged by another node. AFAICT the common understanding of synchronous replication is, that all nodes confirm to have committed the changes of a transaction *before* acknowledging COMMIT to the application (and obviously only *after* the application requested to COMMIT the transaction, so the guarantee is that all nodes commit *sometime* within that time frame, which is certainly possible to guarantee, see 2PC approaches). This guarantee is not provided by the Postgres-R algorithm, nor by the approach presented. Both only guarantee, that the transaction *will* get committed (and thus get visible) on all nodes *sometime* *after* the application requested to commit it (even in case of various failures, that is) [1]. As cited before, that has been enough of a reason for Jan Wieck to call Postgres-R asynchronous, and I certainly see his point. Note that the amount of time that passes between the commit acknowledgment and the actual commit on remote nodes may theoretically be infinitely long. And in practice certainly long enough for an application to notice the difference. However, it still is a practical optimization, because most applications should cope with it just fine. But not all... Do you consider the proposed log shipping approach to be synchronous? How about the Postgres-R algorithm? Regards Markus Wanner [1]: of course these approaches also guarantee that the transaction is committed on the local node *before* acknowledging commit, so that subsequent (seen from the application) queries are guaranteed to see the changes. But that guarantee only holds true for the local node.
* Markus Wanner <markus@bluegap.ch> [081213 16:33]: > Hi, > > Hannu Krosing wrote: > > You can have a variantof sync rep + hot standby where the master does > > not return committed before the slave has both synced the data and > > replied the transaction so that it is visible on slave but in that case > > you may have a usecase, where it is actually visible on slave _before_ > > it is visible on master. > > As long as it's not visible *before* the client requests a COMMIT, that > certainly doesn't matter (because the application cannot check that). Well, I think the PG MVCC (which wal-streaming just ships across somewhere else) will save that. So with hot-standby you could have another client could see the result *after* the COMMIT has been requested, but *before* the COMMIT returns... But we have this situation in a single current PG instance anyways, so it's nothing new.... But with hot-standby, I could also see that it could be done such that the wal-stream is fsynced to disk (i.e. xlog) and acknowledged, but because of a current running query, application of it is delayed... But this is hot-standby's problem of describing itself, not sync-rep. IMHO, sync-rep is about getting the change "durrably to a slave" before acknoledging the COMMIT. That slave could be any number of things: - A "WAL archive" type system having the ability to be used for recover - A PG with special "recovery mode" that reads the stream and applies it - A full hot-standby recovery I could see any and all of those (and probably other) being usefull and used. But in the current patch, it focusses on the streaming (sending), and and a receiver "recovery" mode that can accept/apply them, again, without worrying about acutally running queries (yet) ... a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Markus Wanner wrote:<br /><blockquote cite="mid:494421B1.7040707@bluegap.ch" type="cite"><pre wrap="">Tom Lane wrote: </pre><blockquotetype="cite"><pre wrap="">We won't call it anything, because we never will or can implement that. See the theory of relativity: the notion of exactly simultaneous events at distinct locations isn't even well-defined </pre></blockquote><pre wrap=""> That has never been the point of the discussion. It's rather about the question if changes from transactions are guaranteed to be visible on remote nodes immediately after commit acknowledgment. Whether or not this is guaranteed, in both cases the term "synchronous replication" is commonly used, which is causing confusion. </pre></blockquote><br /> Might it not be true that anybody unfamiliar would beconfused and that this is a bit of a straw man?<br /><br /> I don't think synchronous replication guarantees that it willbe immediately visible. Even if it did push the change to the other machine, and the other machine had committed it,that doesn't guarantee that any reader sees it any more than if I commit to the same machine (no replication), I am guaranteedto see the change from another session. Synchronous replication only means that I can be assured that my changehas been saved permanently by the time my commit completes. It doesn't mean anybody else can see my change or is guaranteedto see my change if the query from another session.<br /><br /> If my application assumes that it can commit toone server, and then read back the commit from another server, and my application breaks as a result, it's because I didn'tunderstand the problem. Even if PostgreSQL didn't use the word "synchronous replication", I could still be confused.I need to understand the problem no matter what words are used.<br /><br /> Cheers,<br /> mark<br /><br /><pre class="moz-signature"cols="72">-- Mark Mielke <a class="moz-txt-link-rfc2396E" href="mailto:mark@mielke.cc"><mark@mielke.cc></a> </pre>
Hi, Aidan Van Dyk wrote: > Well, I think the PG MVCC (which wal-streaming just ships across > somewhere else) will save that. So with hot-standby you could have > another client could see the result *after* the COMMIT has been > requested, but *before* the COMMIT returns... But we have this > situation in a single current PG instance anyways, so it's nothing > new.... AFAIU the proposed algorithm only waits until WAL is written on the slave before acknowledging COMMIT. Application of the changes may be deferred, so it's not necessarily immediately visible on the slave. > But with hot-standby, I could also see that it could be done such that > the wal-stream is fsynced to disk (i.e. xlog) and acknowledged, but > because of a current running query, application of it is delayed... But > this is hot-standby's problem of describing itself, not sync-rep. I'm thinking of the overall system and don't care much if it's hot-standby's or sync-rep's problem. But it's certainly the master which needs to await certain acknowledgments of the slaves. That has so far been discussed within this sync-rep thread. > IMHO, sync-rep is about getting the change "durrably to a slave" before > acknoledging the COMMIT. That slave could be any number of things: > - A "WAL archive" type system having the ability to be used for > recover > - A PG with special "recovery mode" that reads the stream and applies it > - A full hot-standby recovery > > I could see any and all of those (and probably other) being usefull and > used. I certainly agree to that. Regards Markus Wanner
Hi, Mark Mielke wrote: > Might it not be true that anybody unfamiliar would be confused and that > this is a bit of a straw man? Might be. I've neglected the issue myself for a while. > I don't think synchronous replication guarantees that it will be > immediately visible. Even if it did push the change to the other > machine, and the other machine had committed it, that doesn't guarantee > that any reader sees it any more than if I commit to the same machine > (no replication), I am guaranteed to see the change from another > session. AFAIK every snapshot taken after a transaction has acknowledged its commit is guaranteed to see changes from that transaction. Isn't that a pretty frequent and obvious user expectation? > Synchronous replication only means that I can be assured that > my change has been saved permanently by the time my commit completes. It > doesn't mean anybody else can see my change or is guaranteed to see my > change if the query from another session. So you wouldn't be surprised if a transaction from two hours ago isn't visible on another node, just because that node happens to be rather busy with lots of other readers and maintenance tasks? > If my application assumes that it can commit to one server, and then > read back the commit from another server, and my application breaks as a > result, it's because I didn't understand the problem. Well, yeah, depends on user expectations. I'm surprised to hear that you have that understanding of synchronous replication. > Even if PostgreSQL > didn't use the word "synchronous replication", I could still be > confused. I need to understand the problem no matter what words are used. As said, it depends on what the common understanding of "synchronous replication" is. I've so far been under the impression, that these potential lags are unexpected and confusing. Several people pointed me at that problem and I've thus "relabeled" Postgres-R as not being synchronous. I'm at least surprised to suddenly get pushed into the other direction. :-) However, I absolutely agree that it's not that important how we name it. What is important, is that users and developers understand the difference. Regards Markus Wanner
Markus Wanner wrote:<br /><blockquote cite="mid:494436AE.2080207@bluegap.ch" type="cite"><blockquote type="cite"><pre wrap="">Idon't think synchronous replication guarantees that it will be immediately visible. Even if it did push the change to the other machine, and the other machine had committed it, that doesn't guarantee that any reader sees it any more than if I commit to the same machine (no replication), I am guaranteed to see the change from another session. </pre></blockquote><pre wrap=""> AFAIK every snapshot taken after a transaction has acknowledged its commit is guaranteed to see changes from that transaction. Isn't that a pretty frequent and obvious user expectation? </pre></blockquote><br /> Yes - but that's only really true while the sessioncontinues. From another session? I've never assumed that I could reconnect and be guaranteed to get the latest snapshotthat includes absolutely everything that has been committed.<br /><br /> Any system that guaranteed this even wheninvolving multiple machines would be guaranteed to be inefficient and difficult to scale in my opinion. How could anysystem promise to have reasonable commit times while also guaranteeing that once a commit completes, any session to anyother server will be able to see the commit? I think this forces some sort of serialization between multiple machinesand defeats the purpose of having multiple machines. Where before it was indeterminate to know when the commit wouldtake effect at each replica, it's not indeterminate when my commit will succeed. That is, my commit cannot succeed untilevery single server acknowledge that it is has fully received and committed my transaction. What happens if there arenetwork problems, or what happens if I am replicating over a slower link? What if I am committing to 100 servers? Is itreasonable to expect 100 server negotiations to complete in full before my own commit will return?<br /><br /><blockquotecite="mid:494436AE.2080207@bluegap.ch" type="cite"><blockquote type="cite"><pre wrap="">Synchronous replicationonly means that I can be assured that my change has been saved permanently by the time my commit completes. It doesn't mean anybody else can see my change or is guaranteed to see my change if the query from another session. </pre></blockquote><pre wrap="">So you wouldn't be surprised if a transactionfrom two hours ago isn't visible on another node, just because that node happens to be rather busy with lots of other readers and maintenance tasks? </pre></blockquote><br /> Any system that is two hours behind shouldfall out of the pool used to satisfy reads from. So, if there was a surprise, it would be this. I don't believe ACIDrequires that a commit on one server is immediately visible on another server. Any work I do on the "behind" server wouldstill be safe from a transaction and referential integrity perspective. However, if I executed 'commit' on this "behind"server, I would expect the commit to wait until it catches up, or in the case of a 2 hour behind, I would expectthe commit to fail. Look at the alternative - all commits to any server in the pool would be locked up waiting forthis one machine to catch up on 2 hours of transaction. This emphasizes that the problem is that a server two hours ofdate is still in the pool, rather than the problem being keeping things up-to-date.<br /><br /><br /><blockquote cite="mid:494436AE.2080207@bluegap.ch"type="cite"><blockquote type="cite"><pre wrap="">If my application assumes that itcan commit to one server, and then read back the commit from another server, and my application breaks as a result, it's because I didn't understand the problem. </pre></blockquote><pre wrap="">Well, yeah, depends on user expectations.I'm surprised to hear that you have that understanding of synchronous replication. </pre></blockquote><br /> I've seen people face it in the past. Mostrecently we had a presentation from the developer of digg.com, and he described how he had this problem with MySQL andthat he had to work around it.<br /><br /> On a smaller scale and slightly unrelated, I had this problem frequently betweenmemcache and PostgreSQL. That is, memcache would always be latest, but PostgreSQL might not be latest, because thecommit had not occurred.<br /><br /> It seems like a standard enough problem to me. I don't expect Postgres-R to do theimpossible. As with my previous paragraph, I don't expect Postgres-R to wait 2-hours to commit just because one serveris falling behind.<br /><br /><blockquote cite="mid:494436AE.2080207@bluegap.ch" type="cite"><blockquote type="cite"><prewrap="">Even if PostgreSQL didn't use the word "synchronous replication", I could still be confused. I need to understand the problem no matter what words are used. </pre></blockquote><pre wrap=""> As said, it depends on what the common understanding of "synchronous replication" is. I've so far been under the impression, that these potential lags are unexpected and confusing. Several people pointed me at that problem and I've thus "relabeled" Postgres-R as not being synchronous. I'm at least surprised to suddenly get pushed into the other direction. :-) However, I absolutely agree that it's not that important how we name it. What is important, is that users and developers understand the difference</pre></blockquote><br /> I agree they are unexpectedand confusing. I don't agree that they are unexpected or confusing to those knowledgeable in the domain. So, thequestion becomes - whose expectation is wrong? Should the user learn more? Or should we push for a change in terminology?Does it make sense for Postgres-R (which looks excellent to me BTW, at least in principle) be marketed differently,because a few users tie "synchronous replication" to "serialized access"?<br /><br /> Because that's really whatwe're talking about - we're talking about transactions in all sessions being serialized between machines to provide lesssurprise to users who don't understand the complexity of having multiple replicas.<br /><br /> Forget replication - evenfor the exact same server - I don't expect that if I commit from one session, I will be able to see the change immediatelyfrom my other session or a new session that I just opened. Perhaps this is often stable to rely on this, and itis useful for the database server to minimize the window during which the commit becomes visible to others, but I thinkit's a false expectation from the start that it absolutely will be immediately visible to another session. I'm thinkingof situations where some part of the table is in cache. The only way the commit can communicate that the new transactionis available is by during communication between the processes or threads, or between the multiple CPUs on themachine. Do I want every commit to force each session to become fully in alignment before my commit completes? Does PostgreSQLmake this guarantee today? I bet it doesn't if you look far enough into the guts. It might be very fast - I don'tthink it is infinitely fast.<br /><br /> Cheers,<br /> mark<br /><br /><pre class="moz-signature" cols="72">-- Mark Mielke <a class="moz-txt-link-rfc2396E" href="mailto:mark@mielke.cc"><mark@mielke.cc></a> </pre>
On Sat, Dec 13, 2008 at 1:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Robert Haas" <robertmhaas@gmail.com> writes: >> I think we need to reserve the term "synchronous replication" for a >> system where transactions that begin at the same time on the primary >> and standby see the same tuples. Clearly that is "more" synchronous > > We won't call it anything, because we never will or can implement that. > See the theory of relativity: the notion of exactly simultaneous events OK, fine. I'll be more precise. I think we need to reserve the term "synchronous replication" for a system where transactions that begin on the standby after the transactions has committed on the master see the effects of the committed transaction. > at distinct locations isn't even well-defined, because observers at yet > other locations will disagree about what is "simultaneous". And I'm > not just making a joke here --- speed-of-light delays in a WAN are > meaningful compared to current computer speeds. In practice, the > slave and the master will never commit at exactly the same time. > > I agree with the point made upthread that we should use the term > "synchronous replication" the way it's commonly used in the industry. > Inventing our own terminology might be fun but it's not really going > to result in less confusion. I just googled "synchronous replication" and read through the first page of hits. Most of them do not address the question of whether synchronous replication can be said to have be completed when WAL has been received by the standby not but yet applied. One of the ones that does is: http://code.google.com/p/google-mysql-tools/wiki/SemiSyncReplicationDesign ...which refers to what we're proposing to call "Synchronous Replication" as "Semi-Synchronous Replication" (or 2-safe replication) specifically to distinguish it. The other is: http://www.cnds.jhu.edu/pub/papers/cnds-2002-4.pdf ...which doesn't specifically examine the issue but seems to take the opposite position, namely that the server on which the transaction is executed needs to wait only for one server to apply the changes to the database (the others need only to know that they need to commit it; they don't actually need to have done it). However, that same paper refers to two-phase commit as a synchronous replication algorithm, and Wikipedia's discussion of two-phase commit: http://en.wikipedia.org/wiki/Two-phase_commit_protocol ...clearly implies that the transaction must be applied everywhere before it can be said to have committed. The second page of Google results is mostly a further discussion of the MySQL solution, which is mostly described as "semi-synchronous replication". Simon Riggs said upthread that Oracle called this "synchronous redo transport". That is obviously much closer to what we are doing than "synchronous replication". ...Robert
On Sat, 2008-12-13 at 21:35 -0500, Robert Haas wrote: > On Sat, Dec 13, 2008 at 1:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > "Robert Haas" <robertmhaas@gmail.com> writes: > >> I think we need to reserve the term "synchronous replication" for a > >> system where transactions that begin at the same time on the primary > >> and standby see the same tuples. Clearly that is "more" synchronous > > > > We won't call it anything, because we never will or can implement that. > > See the theory of relativity: the notion of exactly simultaneous events > > OK, fine. I'll be more precise. I think we need to reserve the term > "synchronous replication" for a system where transactions that begin > on the standby after the transactions has committed on the master see > the effects of the committed transaction. > If it's guaranteed to be visible on the standby after it's committed on the master, and you don't have any way to make it actually simultaneous, then that implies that it's visible on the slave for some brief period of time before it's committed on the master. That situation is still asymmetric, so why is that a better use of the term "synchronous"? Regards,Jeff Davis
> If it's guaranteed to be visible on the standby after it's committed on > the master, and you don't have any way to make it actually simultaneous, > then that implies that it's visible on the slave for some brief period > of time before it's committed on the master. > > That situation is still asymmetric, so why is that a better use of the > term "synchronous"? Because that happens anyway. If I request a commit on a single, unreplicated server, the server makes the commit visible to new transactions and then sends me a message informing me that the commit has completed. Since the message takes some finite time to reach me, there is a window of time after the commit has completed and before I know that the commit has been completed. Suppose for the sake of argument that the single, unreplicated server did these two tasks in the opposite order - namely, first, it sent a message to the process requesting the commit stating that the commit had completed, and only then made the transaction visible. This would create a race condition: the process requesting the commit might receive the commit and begin a new transaction before the previous transaction had been made visible, and would therefore not be able to see the results of its own previous actions. I think it's fair to say that this behavior would be judged totally intolerable. Therefore, there can't possibly be any applications out there which are depending on the fact that commits don't become visible until they are acknowledged, but there very well could be some applications which depend on the fact that one commits are acknowledged, they are visible. If replication is synchronous in this sense, then I can open a connection to the master, write some data, close the connection, open a new connection to the master or the slave (not caring which), and read back the data that I just wrote (assuming no one else has modified it in the mean time). If it isn't, then I can't. Some people will not care about this, but some will. The point here is that synchronous replication, at least to some people, is going to imply that the user-visible states of the two copies are consistent. To other people, it is going to imply that committed transactions will never be lost even in the event of a catastropic loss of the primary 1 picosecond after the commit is acknowledged. We need to choose some word that implies that we are guaranteeing the latter of these two things but not the former. Otherwise, we will have confused users, and terminological confusion when and if we ever implement the former as well. ...Robert
> Might it not be true that anybody unfamiliar would be confused and that this > is a bit of a straw man? [...] > If my application assumes that it can commit to one server, and then read > back the commit from another server, and my application breaks as a result, > it's because I didn't understand the problem. Even if PostgreSQL didn't use > the word "synchronous replication", I could still be confused. I need to > understand the problem no matter what words are used. That is certainly true. But there is value in choosing words which elucidate the situation as much as possible. ...Robert
On Sat, 2008-12-13 at 22:23 -0500, Robert Haas wrote: > > If it's guaranteed to be visible on the standby after it's committed on > > the master, and you don't have any way to make it actually simultaneous, > > then that implies that it's visible on the slave for some brief period > > of time before it's committed on the master. > > > > That situation is still asymmetric, so why is that a better use of the > > term "synchronous"? > > Because that happens anyway. If I request a commit on a single, > unreplicated server, the server makes the commit visible to new > transactions and then sends me a message informing me that the commit > has completed. Since the message takes some finite time to reach me, > there is a window of time after the commit has completed and before I > know that the commit has been completed. > Oh, I see the distinction now. Thanks for the detailed reply. Regards,Jeff Davis
> The point here is that synchronous replication, at least to some > people, is going to imply that the user-visible states of the two > copies are consistent. To other people, it is going to imply that > committed transactions will never be lost even in the event of a > catastropic loss of the primary 1 picosecond after the commit is > acknowledged. We need to choose some word that implies that we are > guaranteeing the latter of these two things but not the former. > Otherwise, we will have confused users, and terminological confusion > when and if we ever implement the former as well. Right. Before watching this thread, I had thought that the log shipping sync replication behaves former (and I had told so to people in Japan who are interested in 8.4 development. Of course this is my fault, though). Now I understand the log shipping sync replication does not behave same as other "sync replications" such as pgpool and PGCluster (there maybe more, but I don't know) -- Tatsuo Ishii SRA OSS, Inc. Japan
> The point here is that synchronous replication, at least to some > people, is going to imply that the user-visible states of the two > copies are consistent. To other people, it is going to imply that > committed transactions will never be lost even in the event of a > catastropic loss of the primary 1 picosecond after the commit is > acknowledged. We need to choose some word that implies that we are > guaranteeing the latter of these two things but not the former. > Otherwise, we will have confused users, and terminological confusion > when and if we ever implement the former as well. With apologies for replying to my own post: It's also important to understand that these two invariants are completely separate and it is possible to guarantee either without the other. If you want (1), the standby needs to apply the WAL before sending an acknowledgment to the primary but does not necessarily need to write it to disk (of course, it will have to be written to disk before the modified buffers are written to disk, but that's a separate issue). If you want (2), the standby needs to write the WAL to disk before sending the acknowledgment but does not necessarily need to apply it. If you want both, then, you need to wait for both (and it's worth noting that your performance will probably be nothing to write home about). I also did some research on terminology that has been used in the literature. As Jim Gray describes it: 1-safe replication. Transaction is committed when it has been locally WAL-logged to durable storage. Group-safe replication. Transaction is committed when WAL has been received by all remote servers, but not necessarily written to durable storage. Group-safe & 1-safe replication. Transaction is committed when it has been locally WAL-logged to durable storage and WAL has been received by all remote servers. 2-safe replication. Transaction is committed when it has been written to durable storage on both local and remote servers. Very safe replication. As 2-safe, but fails any read-write transaction if the secondary is down. (Actually, it appears that "Transaction Processing" Jim Gray and Andreas Reuter, 1993 uses 2-safe to refer to either 2-safe or group-safe; the distinction between the two is a subsequent development. See e.g. Advances in Database Technology-EDBT 2004 by Elisa Bertino) The term of art for making sure that transactions committed on the primary are visible on the secondary seems to be "one-copy serializability" (see, for example, a Google Books search on that term). ...Robert
Robert Haas wrote: <blockquote cite="mid:603c8f070812131835v7839b68fj736c853241cc7813@mail.gmail.com" type="cite"><pre wrap="">OnSat, Dec 13, 2008 at 1:29 PM, Tom Lane <a class="moz-txt-link-rfc2396E" href="mailto:tgl@sss.pgh.pa.us"><tgl@sss.pgh.pa.us></a>wrote: </pre><blockquote type="cite"><pre wrap="">We won't callit anything, because we never will or can implement that. See the theory of relativity: the notion of exactly simultaneous events </pre></blockquote><pre wrap=""> OK, fine. I'll be more precise. I think we need to reserve the term "synchronous replication" for a system where transactions that begin on the standby after the transactions has committed on the master see the effects of the committed transaction. </pre></blockquote><br /> Wouldn't this be serialized transactions?<br /><br />I'd like to see proof of some sort that PostgreSQL guarantees that the instant a 'commit' returns, any transactions alreadyopen with the appropriate transaction isolation level, or any new sessions *will* see the results of the commit.<br/><br /> I know that most of the time this happens - but what process synchronization steps occur to *guarantee*that this happens?<br /><br /><blockquote cite="mid:603c8f070812131835v7839b68fj736c853241cc7813@mail.gmail.com"type="cite"><pre wrap="">I just googled "synchronousreplication" and read through the first page of hits. Most of them do not address the question of whether synchronous replication can be said to have be completed when WAL has been received by the standby not but yet applied. One of the ones that does is: <a class="moz-txt-link-freetext" href="http://code.google.com/p/google-mysql-tools/wiki/SemiSyncReplicationDesign">http://code.google.com/p/google-mysql-tools/wiki/SemiSyncReplicationDesign</a> ...which refers to what we're proposing to call "Synchronous Replication" as "Semi-Synchronous Replication" (or 2-safe replication) specifically to distinguish it. The other is: <a class="moz-txt-link-freetext" href="http://www.cnds.jhu.edu/pub/papers/cnds-2002-4.pdf">http://www.cnds.jhu.edu/pub/papers/cnds-2002-4.pdf</a> ...which doesn't specifically examine the issue but seems to take the opposite position, namely that the server on which the transaction is executed needs to wait only for one server to apply the changes to the database (the others need only to know that they need to commit it; they don't actually need to have done it). However, that same paper refers to two-phase commit as a synchronous replication algorithm, and Wikipedia's discussion of two-phase commit: <a class="moz-txt-link-freetext" href="http://en.wikipedia.org/wiki/Two-phase_commit_protocol">http://en.wikipedia.org/wiki/Two-phase_commit_protocol</a> ...clearly implies that the transaction must be applied everywhere before it can be said to have committed. The second page of Google results is mostly a further discussion of the MySQL solution, which is mostly described as "semi-synchronous replication". Simon Riggs said upthread that Oracle called this "synchronous redo transport". That is obviously much closer to what we are doing than "synchronous replication". </pre></blockquote><br /> Two phase commit doesn't imply that the transaction is guaranteed tobe immediately visible. See my previous paragraph. Unless transactions are locked from starting until they are able toprove that they have the latest commit (a feat which I'm going to theorize as impossible - because the moment you waitfor a commit, and you begin again, you really have no guarantee that another commit has not occurred in the mean time),I think it's clear that two phase commit guarantees that the commit has taken place, but does *not* guarantee anythingabout visibility.<br /><br /> It might be a good bet - but guarantee? There is no such guarantee.<br /><br /> Cheers,<br/> mark<br /><br /><pre class="moz-signature" cols="72">-- Mark Mielke <a class="moz-txt-link-rfc2396E" href="mailto:mark@mielke.cc"><mark@mielke.cc></a> </pre>
Hi all, I just wanted to point out a detail that I have not seen mentioned in this thread (but I might have skipped some messages and I apologize in advance if this is a duplicate). What the application is going to see is a failure when the postmaster it is connected to is going down. If this happen at commit time, I think that there is no guarantee for the application to know what happened: 1. failure occurred before the request reached postmaster: no instance committed 2. failure occurred during commit: might be committed on either nodes 3. failure occurred while sending back ack of commit to client: both instances have committed But for the client, it will all look the same: an error on commit. This is just to point out that despite all your efforts, the client might think that some transactions have failed (error on commit) but they are actually committed. If you don't put some state in the driver that is able to check at failover time if the commit operation succeeded or not, it does not really matter what happens for in-flight transactions (or in-commit transactions) at failure time. In all cases, a manual inspection of the database logs will be required. Actually, if there was a way to query the database about the status of a particular transaction by providing a cluster-wide unique id, that would help a lot. I wrote a paper on the issues with database replication at Sigmod earlier this year (http://infoscience.epfl.ch/record/129042). Even though it was targeted at middleware replication, I think that some of it is still relevant for the problem at hand. Regarding the wording, if experts can't agree, you can be sure that users won't either. Most of them don't have a clue about the different flavors of replication. So as long as you state clearly how it behaves and define all the terms you use that should be fine. manu -- Emmanuel Cecchet FTO @ Frog Thinker Open Source Development & Consulting -- Web: http://www.frogthinker.org email: manu@frogthinker.org Skype: emmanuel_cecchet
Robert Haas wrote: > The term of art for making sure that transactions committed on the > primary are visible on the secondary seems to be "one-copy > serializability" (see, for example, a Google Books search on that > term). Not exactly. 1-copy-serializability which is the standard for multi-master solutions, guarantees that transactions are executed in the same serializable order at each replica (which means that transactions can be executed in different order and committed at different times on different replica as long as a consistent serializable view is presented to the client). There are a number of optimizations in that area but in a multi-master case, replicas rarely commit at the same time. There are interesting papers on the subject (like Tashkent & Tashkent+ based on Postgres) for those who want to understand these problems more thoroughly. Hope this helps, manu -- Emmanuel Cecchet FTO @ Frog Thinker Open Source Development & Consulting -- Web: http://www.frogthinker.org email: manu@frogthinker.org Skype: emmanuel_cecchet
On Sun, 2008-12-14 at 13:31 +0900, Tatsuo Ishii wrote: > > The point here is that synchronous replication, at least to some > > people, is going to imply that the user-visible states of the two > > copies are consistent. To other people, it is going to imply that > > committed transactions will never be lost even in the event of a > > catastropic loss of the primary 1 picosecond after the commit is > > acknowledged. We need to choose some word that implies that we are > > guaranteeing the latter of these two things but not the former. > > Otherwise, we will have confused users, and terminological confusion > > when and if we ever implement the former as well. > > Right. Before watching this thread, I had thought that the log > shipping sync replication behaves former (and I had told so to people > in Japan who are interested in 8.4 development. Of course this is my > fault, though). > > Now I understand the log shipping sync replication does not behave > same as other "sync replications" such as pgpool and PGCluster (there > maybe more, but I don't know) GENERAL COMMENTS, not to anybody in particular: 'Tis but thy name that is my enemy. ... What's in a name? That which we call a rose By any other name would smell as sweet. ... Juliet, from "Romeo and Juliet" I am truly lost to understand why the *name* "synchronous replication" causes so much discussion, yet nobody has discussed what they would actually like the software to *do* (this being a software discussion list...). AFAICS we can make the software behave like *any* of the definitions discussed so far. It is certainly far too early to say what the final exact behaviour will be and there is no reason at all to pre-suppose that it need only be a single behaviour. I'm in favour of options, generally, but I would say that the distinction between some of these options is mostly very fine and strongly doubt whether people would use them if they existed. *But* I think we can add them at a later stage of development if requirements genuinely exist once all the benefits *and* costs are understood. I would also point out that the distinction made between various meanings of synchronous is *only* important if Hot Standby is included as well. And that is closely linked to the replication feature, which we really need to complete first. We have much to do yet. So let's please end the name debate there and think about software. ... We can make the reply to a commit message when any of the following events have occurred 1. We sent the message to standby 2. We received the message on standby 3. We wrote the WAL to the WAL file 4. We fsync'd the WAL file 5. We CRC checked the WAL commit record 6. We applied the WAL commit record Now you might think from what people have said that having synchronised contents on both primary and standby is the only way to achieve exactly the same results to queries on both nodes. Another way is to utilise a snapshot taken on the primary and simply wait until the standby catches up with that snapshot's LSN. So there is more than one way of achieving a particular result and it is not dependent upon the exact synchronisation we employ at commit time. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > I am truly lost to understand why the *name* "synchronous replication" > causes so much discussion, yet nobody has discussed what they would > actually like the software to *do* (this being a software discussion > list...). AFAICS we can make the software behave like *any* of the > definitions discussed so far. > I think people have talked about 'like' in the context of user expectations. That is, there seems to exist a set of people (probably those who've never worked with a multi-replica solution before) who expect that once commit completes on one server, they can query any other master or slave and be guaranteed visibility of the transaction they just committed. These people may theoretically change their decision to not use Postgres-R, or at least change their approach to how they work with Postgres-R, if the name was in some way more intuitive to them in terms of what is actually being provided. "Synchronous replication" itself says only details about replication, it does not say anything about visibility, so to some degree, people are focusing on the wrong term as the problem. Even if it says "asynchronous replication" - not sure that I care either way - this doesn't improve the understanding for the casual user of what is happening behind the scenes. Neither synchronous nor asynchronous guarantees that the change will be immediately visible from other nodes after I type 'commit;'. Asynchronous might err on the side of not immediately visible, where synchronous might (incorrectly) imply immediate visibility, but it's not an accurate guarantee to provide. Synchronous does not guarantee visibility immediately after. Some indefinite but usually short time must normally pass from when my 'commit;' completes until when the shared memory visible to my process "sees" the transaction. Multiple replicas with network latency or reliability issues increases the theoretical minimum size of this window to something that would be normally encountered as opposed to something that is normally not encountered. The only way to guarantee visibility is to ensure that the new transaction is guaranteed to be visible from a shared memory perspective on every machine in the pool, and every active backend process. If my 'commit;' is going to wait for this to occur, first, I think this forces every commit to have numerous network round trips to each machine in the pool, it forces each machine in the pool to be network accessible and responsive, it forces all commits to be serialized in the sense of "the slowest machine in the pool determines the time for my commit to complete", and I think it implies some sort of inter-process signalling, or at the very least CPU level signalling about shared memory (in the case of multiple CPUs). People such as myself think that a visibility guarantee is unreasonable and certain to cause scalability or reliability problems. So, my 'like' is an efficient multi-master solution where if I put 10 machines in the pool, I expect my normal query/commit loads to approach 10X as fast. My like prefers scalability over guarantees that may be difficult to provide, and probably are not provided today even in a single server scenario. > It is certainly far too early to say what the final exact behaviour will > be and there is no reason at all to pre-suppose that it need only be a > single behaviour. I'm in favour of options, generally, but I would say > that the distinction between some of these options is mostly very fine > and strongly doubt whether people would use them if they existed. *But* > I think we can add them at a later stage of development if requirements > genuinely exist once all the benefits *and* costs are understood. > The above 'commit;' behaviour difference - whether it completes when the commit is permanent (it definitely will be applied for certain to all replicas - it just may take time to apply to all replicas), or when the commit has actually taken effect (two-phase commit on all replicas - and both phases have completed on all replicas - what happens if second phase commit fails on one or more servers?), or when the commit is guaranteed to be visible from all existing and new sessionss (two-phase commit plus additional signalling required?) might be such an option. I'm doubtful, though - as the difference in implementation between the first and second is pretty significant. I'm curious about your suggestion to direct queries that need the latest snapshot to the 'primary'. I might have misunderstood it - but it seems that the expectation from some is that *all* sessions see the latest snapshot, so would this not imply that all sessions would be redirect to the 'primary'? I don't think it is reasonable myself, but I might be misunderstanding something... Cheers, mark -- Mark Mielke <mark@mielke.cc>
> We can make the reply to a commit message when any of the following > events have occurred > > 1. We sent the message to standby > 2. We received the message on standby > 3. We wrote the WAL to the WAL file > 4. We fsync'd the WAL file > 5. We CRC checked the WAL commit record > 6. We applied the WAL commit record Also 0. The same time we would have done so if replication had not been configured at all. I think the basic problem here is that we can talk about "asynchronous replication" and "synchronous replication" but there are n>2 possible/useful behaviors (I would guess principally 0, 2, 4, and 6, but YMMV). So we're going to need some way to clarify what we mean. BTW, in case my previous emails on this topic might have given someone the contrary impression, I'm not really that worked up about this either. Interesting? Yes. Have opinions? Yes. Lie awake nights worrying about it? Nope. :-) ...Robert
Mark Mielke wrote: > Forget replication - even for the exact same server - I don't expect > that if I commit from one session, I will be able to see the change > immediately from my other session or a new session that I just opened. > Perhaps this is often stable to rely on this, and it is useful for the > database server to minimize the window during which the commit becomes > visible to others, but I think it's a false expectation from the start > that it absolutely will be immediately visible to another session. I'm > thinking of situations where some part of the table is in cache. The > only way the commit can communicate that the new transaction is > available is by during communication between the processes or threads, > or between the multiple CPUs on the machine. Do I want every commit to > force each session to become fully in alignment before my commit > completes? Does PostgreSQL make this guarantee today? I bet it doesn't > if you look far enough into the guts. It might be very fast - I don't > think it is infinitely fast. FYI: I haven't been able to prove this. Multiple sessions running on my dual-core CPU seem to be able to see the latest commits before they begin executing. Am I wrong about this? Does PostgreSQL provide a intentional guarantee that a commit from one session that completes immediately followed by a query from another session will always find the commit effect visible (provide the transaction isolation level doesn't get in the way)? Or is the machine and algorithms just fast enough that by the time it executes the query (up to 1 ms later) the commit is always visible in practice? Cheers, mark -- Mark Mielke <mark@mielke.cc>
Mark Mielke wrote: > Mark Mielke wrote: >> Forget replication - even for the exact same server - I don't expect >> that if I commit from one session, I will be able to see the change >> immediately from my other session or a new session that I just opened. >> Perhaps this is often stable to rely on this, and it is useful for the >> database server to minimize the window during which the commit becomes >> visible to others, but I think it's a false expectation from the start >> that it absolutely will be immediately visible to another session. I'm >> thinking of situations where some part of the table is in cache. The >> only way the commit can communicate that the new transaction is >> available is by during communication between the processes or threads, >> or between the multiple CPUs on the machine. Do I want every commit to >> force each session to become fully in alignment before my commit >> completes? Does PostgreSQL make this guarantee today? I bet it doesn't >> if you look far enough into the guts. It might be very fast - I don't >> think it is infinitely fast. > > FYI: I haven't been able to prove this. Multiple sessions running on my > dual-core CPU seem to be able to see the latest commits before they > begin executing. Am I wrong about this? Does PostgreSQL provide a > intentional guarantee that a commit from one session that completes > immediately followed by a query from another session will always find > the commit effect visible (provide the transaction isolation level > doesn't get in the way)? Yes. PostgreSQL does guarantee that, and I would expect any other DBMS to do the same. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi, Le 14 déc. 08 à 16:48, Simon Riggs a écrit : > I am truly lost to understand why the *name* "synchronous replication" > causes so much discussion, yet nobody has discussed what they would > actually like the software to *do* (this being a software discussion > list...). AFAICS we can make the software behave like *any* of the > definitions discussed so far. It seems that the easy parts are the one the more people will participate into. Maybe that's that simple. > We can make the reply to a commit message when any of the following > events have occurred > > 1. We sent the message to standby > 2. We received the message on standby > 3. We wrote the WAL to the WAL file > 4. We fsync'd the WAL file > 5. We CRC checked the WAL commit record > 6. We applied the WAL commit record Ok, so let's talk about this easy part: my understanding of "synchronous replication" is that it gives to its users the strong guarantee that at commit time the transaction is secured to the slave(s). That means you get the D of ACID on more than one server. Why synchronous? Because you know the durability is ensured exactly when you receive the COMMIT ack. So I'm with Simon on this, the term Synchronous Replication does describe accurately what's being implemented here, and on the other hand, as so many of us are saying, it's true that it tells very little about it. Those 6 options are all in the scope of the infamous naming, just different guarantees level, from almost strong to very strong, with some "almost, but not quite, entirely unlike the strong I want". Pick your naming here too. At least, that's how I'm understanding this, the bottom line of why care sending this email is that maybe it'll help some people to recover from sleep deprivation ;) My 2¢, - -- dim -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Darwin) iEYEARECAAYFAklFcEsACgkQlBXRlnbh1bk0YwCfa+zGBKTK5EoH/Nmu0x+R6vKI buAAniyL6Z+3MdT4rim5/xZQvdr4QOIQ =iHnY -----END PGP SIGNATURE-----
Robert Haas wrote: >> We can make the reply to a commit message when any of the following >> events have occurred >> >> 1. We sent the message to standby >> 2. We received the message on standby >> 3. We wrote the WAL to the WAL file >> 4. We fsync'd the WAL file >> 5. We CRC checked the WAL commit record >> 6. We applied the WAL commit record Perhaps it'd be useful if the failure modes these are trying to protect against were described too. If I understand right. 1. Protects all the transactions from the failure of the master; so long as neither the network nor the slave machinedie soon? 2. Protects all the transactions from the failure of the master and the network between the slave and master, so longas the slave doesn't die soon? 3. Same as #2? 4. Protects against the failure of the master, the network, and parts of the slave; so long as the slave's disk survivesthe failure? 5. Protects against all of the above, and bit-errors in the memories of the slave machine (except the slave's disk controller?)? Or are we reading-back the CRC from the slave's disk and comparing to the CRC computed on the master whereit might protect from even more? 6. Same as 4? If this is right, #2, #3, #4, and #6 feel similar except that they're protecting against failures of different (but still all incomplete) subsets of the hardware on the slave, right?
Heikki Linnakangas wrote: > Mark Mielke wrote: >> FYI: I haven't been able to prove this. Multiple sessions running on >> my dual-core CPU seem to be able to see the latest commits before >> they begin executing. Am I wrong about this? Does PostgreSQL provide >> a intentional guarantee that a commit from one session that completes >> immediately followed by a query from another session will always find >> the commit effect visible (provide the transaction isolation level >> doesn't get in the way)? > Yes. PostgreSQL does guarantee that, and I would expect any other DBMS > to do the same. Where does the expectation come from? I don't recall ever reading it in the documentation, and unless the session processes are contending over the integers (using some sort of synchronization primitive) in memory that represent the "latest visible commit" on every single select, I'm wondering how it is accomplished? If they are contending over these integers, doesn't that represent a scaling limitation, in the sense that on a 32-core machine, they're going to be fighting with each other to get the latest version of these shared integers into the CPU for processing? Maybe it's such a small penalty that we don't care? :-) I was never instilled with the logic that 'commit in one session guarantees visibility of the effects in another session'. But, as I say above, I wasn't able to make PostgreSQL "fail" in this regard. So maybe I have no clue what I am talking about? :-) If you happen to know where the code or documentation makes this promise, feel free to point it out. I'd like to review the code. If you don't know - don't worry about it, I'll find it later... Cheers, mark -- Mark Mielke <mark@mielke.cc>
When the database says the data is committed it has to mean the data is really committed. Imagine if you looked at a bank account balance after withdrawing all the money and saw a balance which didn't reflect the withdrawal and allowed you to withdraw more money again... -- Greg On 14 Dec 2008, at 14:44, Mark Mielke <mark@mark.mielke.cc> wrote: > Mark Mielke wrote: >> Forget replication - even for the exact same server - I don't >> expect that if I commit from one session, I will be able to see the >> change immediately from my other session or a new session that I >> just opened. Perhaps this is often stable to rely on this, and it >> is useful for the database server to minimize the window during >> which the commit becomes visible to others, but I think it's a >> false expectation from the start that it absolutely will be >> immediately visible to another session. I'm thinking of situations >> where some part of the table is in cache. The only way the commit >> can communicate that the new transaction is available is by during >> communication between the processes or threads, or between the >> multiple CPUs on the machine. Do I want every commit to force each >> session to become fully in alignment before my commit completes? >> Does PostgreSQL make this guarantee today? I bet it doesn't if you >> look far enough into the guts. It might be very fast - I don't >> think it is infinitely fast. > > FYI: I haven't been able to prove this. Multiple sessions running on > my dual-core CPU seem to be able to see the latest commits before > they begin executing. Am I wrong about this? Does PostgreSQL provide > a intentional guarantee that a commit from one session that > completes immediately followed by a query from another session will > always find the commit effect visible (provide the transaction > isolation level doesn't get in the way)? Or is the machine and > algorithms just fast enough that by the time it executes the query > (up to 1 ms later) the commit is always visible in practice? > > Cheers, > mark > > -- > Mark Mielke <mark@mielke.cc> > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
> If this is right, #2, #3, #4, and #6 feel similar except > that they're protecting against failures of different (but > still all incomplete) subsets of the hardware on the slave, right? Right. Actually, the biggest difference with #6 has nothing to do with protecting against failures. It has rather to do with the ease of writing applications in the context of hot standby. You can close your connection, open a connection to a different server, and know that your transactions will be reflected there. On the other hand, I'd be surprised if it didn't come with a substantial performance penalty, so it may not be too practical in real life even if it sounds good on paper. #1 , #3, and #5 don't feel that useful to me. In the case of #1, sending your WAL over the network and then not checking that it got there is sort of silly: the likelihood of packet loss on the network has got to be several orders of magnitude more likely than a failure on the master. #3 and #5 just don't seem to provide any real benefits over their immediate predecessors. Honestly, I think the most useful thing is probably going to be asynchronous replication: in other words, when a commit is requested on the master, we write WAL and return success. In the background, we stream the WAL to a secondary, which writes it and applies it. This will give us a secondary which is mostly up to date (and can run queries, with hot standby) without killing performance. The other options are going to be for environments where losing a transaction is really, really bad, or (in the case of #6) read-mostly environments where it's useful to spread the query load out across several servers, but the overhead associated with waiting for the rare write transactions to apply everywhere is tolerable. ...Robert
Greg Stark wrote: > When the database says the data is committed it has to mean the data > is really committed. Imagine if you looked at a bank account balance > after withdrawing all the money and saw a balance which didn't reflect > the withdrawal and allowed you to withdraw more money again... Within the same session - sure. From different sessions? PostgeSQL MVCC let's you see an older snapshot, although it does prefer to have the latest snapshot with each command. For allowing to withdraw more money again, I would expect some sort of locking "SELECT ... FOR UPDATE;" to be used. This lock then forces the two transactions to become serialized and the second will either wait for the first to complete or fail. Any banking program that assumed that it could SELECT to confirm a balance and then UPDATE to withdraw the money as separate instructions would be a bad banking program. To exploit it, I would just have to start both operations at the same time - they both SELECT, they both see I have money, they both give me the money and UPDATE, and I get double the money (although my balance would show a big negative value - but I'm already gone...). Database 101. When I asked for "does PostgreSQL guarantee this?" I didn't mean hand waving examples or hand waving expectations. I meant a pointer into the code that has some comment that says "we want to guarantee that a commit in one session will be immediately visible to other sessions, and that a later select issued in the other sessions will ALWAYS see the commit whether 1 nanosecond later or 200 seconds later" Robert's expectation and yours seem like taking this "guarantee" for granted rather than being justified with design intent and proof thus far. :-) Given my experiment to try and force it to fail, I can see why this would be taken for granted. Is this a real promise, though? Or just a unlikely scenario that never seems to be hit? To me, the question is relevant in terms of the expectations of a multi-replica solution. We know people have the expectation. We know it can be convenient. Is the expectation valid in the first place? I've probably drawn this question out too long and should do my own research and report back... Sorry... :-) Cheers, mark -- Mark Mielke <mark@mielke.cc>
Mark Mielke wrote: > When I asked for "does PostgreSQL guarantee this?" I didn't mean hand > waving examples or hand waving expectations. I meant a pointer into the > code that has some comment that says "we want to guarantee that a commit > in one session will be immediately visible to other sessions, and that a > later select issued in the other sessions will ALWAYS see the commit > whether 1 nanosecond later or 200 seconds later" Robert's expectation > and yours seem like taking this "guarantee" for granted rather than > being justified with design intent and proof thus far. :-) Given my > experiment to try and force it to fail, I can see why this would be > taken for granted. Is this a real promise, though? Yes. In a nutshell, commit works like this: 1. Write and flush WAL record about the commit 2. Mark the transaction as committed in clog 3. Remove the xid from the shared memory ProcArray. 4. Release locks and other resources 5. Reply to client that the transaction has been committed. After step 3, any backend taking a snapshot will see the transaction as committed. Since we only reply to the client at step 5, it is guaranteed that a transaction beginning after step 5, as well as an already open transaction taking a new snapshot (ie. running a new command in read committed mode) after that will see the transaction as committed. The relevant code is in CommitTransaction() in xact.c. > To me, the question is relevant in terms of the expectations of a > multi-replica solution. We know people have the expectation. Yeah, I think Robert is right. We should reserve the term "synchronous replication" for the mode where that guarantee holds for the slave as well. In fact, waiting for reply from standby server before acknowledging a commit to the client is a bit pointless otherwise. It puts you in a strange situation, where you're waiting for the commits in normal operation, but if there's a network glitch or the standby goes down, you're willing to go ahead without it. You get a high guarantee that your data is up-to-date in the standby, except when it isn't. Which isn't much of a guarantee. But with hot standby, it makes a lot of sense. The guarantee is that if the standby is accepting queries, it's up-to-date with the primary. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Mark Mielke wrote: > Where does the expectation come from? I don't recall ever reading it in > the documentation, and unless the session processes are contending over > the integers (using some sort of synchronization primitive) in memory > that represent the "latest visible commit" on every single select, I'm > wondering how it is accomplished? The "integers" you're imagining are the ProcArray. Every backend has an entry there, and among other things it contains the current XID the backend is running. When a backend takes a new snapshot (on every single select in read committed mode), it locks the ProcArray, scans all the entries and collects all the XIDs listed there in the snapshot. Those are the set of transactions that were running when the snapshot was taken, and is used in the visibility checks. > If they are contending over these> integers, doesn't that represent a scaling limitation, in the sense that> on a 32-coremachine, they're going to be fighting with each other to> get the latest version of these shared integers into theCPU for> processing? Maybe it's such a small penalty that we don't care? :-) The ProcArrayLock is indeed quite busy on systems with a lot of CPUs. It's held for such short times that it's not a problem usually, but it can become a bottleneck with a machine like that with all backends running small transactions. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Sun, 2008-12-14 at 12:57 -0500, Mark Mielke wrote: > I'm curious about your suggestion to direct queries that need the > latest > snapshot to the 'primary'. I might have misunderstood it - but it > seems > that the expectation from some is that *all* sessions see the latest > snapshot, so would this not imply that all sessions would be redirect > to > the 'primary'? I don't think it is reasonable myself, but I might be > misunderstanding something... I said "a snapshot taken on the primary", but the query would run on the standby. Synchronising primary and standby so that they are identical from the perspective of a query requires some synchronisation delay. I'm pointing out that the synchronisation delay can occur * at the time we apply WAL - which will slow down commits (i.e. #6 on my previous list of options) * at the time we run a query that needs to see primary and standby synchronised So the same effect can be achieved in various ways. The first way would require *all* transactions to be applied on standby, i.e. option #6 for all transactions. That is a performance disaster and I would not force that onto everybody. The second way can be done by taking a snapshot on the primary, with an associated LSN, then using that snapshot on the standby. That is somewhat complex, but possible. I see the requirement for getting the same answer on multiple nodes as a further extension of "transaction isolation mode" and think that not all people will want this, so we should allow that as an option. I'm not going to worry about this at the moment. Hot standby will be useful without this and so I regard this as a secondary objective. Rome wasn't built in a single release, or something like that. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Sun, 2008-12-14 at 21:41 -0500, Robert Haas wrote: > > If this is right, #2, #3, #4, and #6 feel similar except > > that they're protecting against failures of different (but > > still all incomplete) subsets of the hardware on the slave, right? > > Right. Actually, the biggest difference with #6 has nothing to do > with protecting against failures. It has rather to do with the ease > of writing applications in the context of hot standby. You can close > your connection, open a connection to a different server, and know > that your transactions will be reflected there. On the other hand, > I'd be surprised if it didn't come with a substantial performance > penalty, so it may not be too practical in real life even if it sounds > good on paper. > > #1 , #3, and #5 don't feel that useful to me. Yes, looks that way for me also. Good analysis Ron. I agree with Robert that #6 is there for other reasons. #2 corresponds to DRBD algorithm B #4 corresponds to DRBD algorithm C Fujii-san, please can we incorporate those two options, rather than just one choice "synchronous_replication = on". They look like two commonly requested options. #6 is an additional synchronization step in Hot Standby. I would say that people won't want that when they see how it performs (they probably won't want #4 either for that same reason, but that is for robustness). Also, I would point out that the class of synch_rep is selected by the user on the primary and can vary from transaction to transaction. That is a very good thing, as far as I am concerned. We would need to enforce #6 for all transactions (if we implemented synchronisation in this way). -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > I am truly lost to understand why the *name* "synchronous replication" > causes so much discussion, yet nobody has discussed what they would > actually like the software to *do* It's the color of the bikeshed ... > We can make the reply to a commit message when any of the following > events have occurred > > 1. We sent the message to standby > 2. We received the message on standby > 3. We wrote the WAL to the WAL file > 4. We fsync'd the WAL file > 5. We CRC checked the WAL commit record > 6. We applied the WAL commit record In DRBD tradition, I suggest you implement all of them, or at least factor the code so that each of them can be a one line change. (We can probably later drop one or two options.)
> In fact, waiting for reply from standby server before acknowledging a commit > to the client is a bit pointless otherwise. It puts you in a strange > situation, where you're waiting for the commits in normal operation, but if > there's a network glitch or the standby goes down, you're willing to go > ahead without it. You get a high guarantee that your data is up-to-date in > the standby, except when it isn't. Which isn't much of a guarantee. It protects you against a catastrophic loss of the primary, which is a non-trivial consideration. At the risk of being ghoulish, imagine that you are a large financial company headquartered in the world trade center. ...Robert
It's a real promise. The reason you're getting hand-wavy answers is because it's such a basic requirement that I'm trying to point out just how fundamental a requirement it is. If you want to see the actual code which guarantees this take a look around the code for procarray - in particular the code for taking a snapshot. There are comments there about what locks are needed when committing and when taking a snapshot and why. But it's quite technical. -- Greg On 15 Dec 2008, at 02:03, Mark Mielke <mark@mark.mielke.cc> wrote: > Greg Stark wrote: >> When the database says the data is committed it has to mean the >> data is really committed. Imagine if you looked at a bank account >> balance after withdrawing all the money and saw a balance which >> didn't reflect the withdrawal and allowed you to withdraw more >> money again... > > Within the same session - sure. From different sessions? PostgeSQL > MVCC let's you see an older snapshot, although it does prefer to > have the latest snapshot with each command. > > For allowing to withdraw more money again, I would expect some sort > of locking "SELECT ... FOR UPDATE;" to be used. This lock then > forces the two transactions to become serialized and the second will > either wait for the first to complete or fail. Any banking program > that assumed that it could SELECT to confirm a balance and then > UPDATE to withdraw the money as separate instructions would be a bad > banking program. To exploit it, I would just have to start both > operations at the same time - they both SELECT, they both see I have > money, they both give me the money and UPDATE, and I get double the > money (although my balance would show a big negative value - but I'm > already gone...). Database 101. > > When I asked for "does PostgreSQL guarantee this?" I didn't mean > hand waving examples or hand waving expectations. I meant a pointer > into the code that has some comment that says "we want to guarantee > that a commit in one session will be immediately visible to other > sessions, and that a later select issued in the other sessions will > ALWAYS see the commit whether 1 nanosecond later or 200 seconds > later" Robert's expectation and yours seem like taking this > "guarantee" for granted rather than being justified with design > intent and proof thus far. :-) Given my experiment to try and force > it to fail, I can see why this would be taken for granted. Is this a > real promise, though? Or just a unlikely scenario that never seems > to be hit? > > To me, the question is relevant in terms of the expectations of a > multi-replica solution. We know people have the expectation. We know > it can be convenient. Is the expectation valid in the first place? > > I've probably drawn this question out too long and should do my own > research and report back... Sorry... :-) > > Cheers, > mark > > -- > Mark Mielke <mark@mielke.cc> >
Robert Haas wrote: >> In fact, waiting for reply from standby server before acknowledging a commit >> to the client is a bit pointless otherwise. It puts you in a strange >> situation, where you're waiting for the commits in normal operation, but if >> there's a network glitch or the standby goes down, you're willing to go >> ahead without it. You get a high guarantee that your data is up-to-date in >> the standby, except when it isn't. Which isn't much of a guarantee. > > It protects you against a catastrophic loss of the primary, which is a > non-trivial consideration. At the risk of being ghoulish, imagine > that you are a large financial company headquartered in the world > trade center. So you'd want all commits to wait until the transaction is safely replicated in the standby. But if there's a network glitch, or the standby is restarted, you're happy to reply to the client that it's committed if it's only safely committed in the primary. Essentially, you wait for the reply as long the standby responds within X seconds, but if it takes more then Y seconds, you don't wait. I know that people do that, but it seems counterintuitive to me. In that case, when the primary acks the transaction as committed, you only know that it's safely committed in the primary; it doesn't give any hard guarantee about the state in the standby. But when you consider the possibility to use the standby for queries, the synchronous mode makes sense too. I'm not opposed to providing all the options, but the synchronous mode where we can guarantee that if you query the standby, you will see the effects of all transactions committed in the primary, makes the synchronous mode much more interesting. If you don't need that property, you're most likely more happy with asynchronous mode anyway. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
* Robert Haas <robertmhaas@gmail.com> [081215 07:32]: > > In fact, waiting for reply from standby server before acknowledging a commit > > to the client is a bit pointless otherwise. It puts you in a strange > > situation, where you're waiting for the commits in normal operation, but if > > there's a network glitch or the standby goes down, you're willing to go > > ahead without it. You get a high guarantee that your data is up-to-date in > > the standby, except when it isn't. Which isn't much of a guarantee. > > It protects you against a catastrophic loss of the primary, which is a > non-trivial consideration. At the risk of being ghoulish, imagine > that you are a large financial company headquartered in the world > trade center. This was exacty my original point - I want the transaction durably on the slave before the commit is acknowledged (to build as much local redunancy as I can), but I certatily *don't* want to loose the ability to use WAL archiving, because I ship my WAL off-site too... The ability to have an extra local copy is good. But I'm certainly not going to want to give up my off-site backup/WAL for it... a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
> So you'd want all commits to wait until the transaction is safely replicated > in the standby. But if there's a network glitch, or the standby is > restarted, you're happy to reply to the client that it's committed if it's > only safely committed in the primary. Essentially, you wait for the reply as > long the standby responds within X seconds, but if it takes more then Y > seconds, you don't wait. I know that people do that, but it seems > counterintuitive to me. In that case, when the primary acks the transaction > as committed, you only know that it's safely committed in the primary; it > doesn't give any hard guarantee about the state in the standby. I understand you're point, but I think there's still a use case. The idea is that declaring the secondary dead is a rare event, and there's some mechanism by which you're enabled to page your network staff, and they hightail it into the office to fix the problem. It might not be the way that you want to run your system, but I don't think it's unreasonable for someone else to want it. > But when you consider the possibility to use the standby for queries, the > synchronous mode makes sense too. > I'm not opposed to providing all the options, but the synchronous mode where > we can guarantee that if you query the standby, you will see the effects of > all transactions committed in the primary, makes the synchronous mode much > more interesting. If you don't need that property, you're most likely more > happy with asynchronous mode anyway. I agree that asynchronous mode will be the right solution for a very large subset of our users. ...Robert
Fujii-san, Just repeating this in case you lost this comment: On Mon, 2008-12-15 at 09:40 +0000, Simon Riggs wrote: > Fujii-san, please can we incorporate those two options, rather than just > one choice "synchronous_replication = on". They look like two commonly > requested options. I see the comment in line 230+ of walreceiver.c, so understand that you have implemented option #3 from the following list. So from my previous list 1. We sent the message to standby (A) 2. We received the message on standby 3. We wrote the WAL to the WAL file (B) 4. We fsync'd the WAL file (C) 5. We CRC checked the WAL commit record 6. We applied the WAL commit record Please could you also add an option #4, i.e. add the *option* to fsync the WAL to disk at commit time also. That requires us to add a third option to synchronous_replication parameter. That then means we will have robustness options that map directly to DRBD algorithms A, B and C (shown in brackets in the above list). I believe these map also to Data Guard options Maximum Performance and Maximum Availability. AFAICS if we implement the additional items I've requested over the last few days, then the architecture is now at a good point for 8.4 and we can begin to look at low level implementation details. Or put another way, I'm not expecting to come up with more architecture changes. > #6 is an additional synchronization step in Hot Standby. I would say > that people won't want that when they see how it performs (they probably > won't want #4 either for that same reason, but that is for robustness). We can jointly add option #6 once we have both sync rep and hot standby committed, or at a late stage of hot standby development. There's not much point looking at it before then. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Mon, 2008-12-15 at 09:19 -0500, Robert Haas wrote: > I understand you're point, but I think there's still a use case. The > idea is that declaring the secondary dead is a rare event, and there's > some mechanism by which you're enabled to page your network staff, and > they hightail it into the office to fix the problem. It might not be > the way that you want to run your system, but I don't think it's > unreasonable for someone else to want it. > Agreed: there's an analogy to RAID here. When a disk goes out, it still allows writes, but moves to a degraded state. Hopefully your monitoring system notifies you, and you fix it. Also, let's say that the standby suffers catastrophic storage failure. Now you only have your data on one server anyway (the primary). Rejecting new transactions from committing doesn't save all the old transactions in the event of a subsequent storage failure on the primary. I'm not advocating this option in particular, other than saying that it seems like a reasonable option to me. Regards,Jeff Davis
Peter Eisentraut wrote: > Simon Riggs wrote: >> I am truly lost to understand why the *name* "synchronous replication" >> causes so much discussion, yet nobody has discussed what they would >> actually like the software to *do* > > It's the color of the bikeshed ... Hmmm. I thought this was pretty clear. There's three levels of synch which are useful features: 1) "synchronus" standby which is really asynchronous, but only has a gap of < 100ms. 2) Synchronous standby which guarentees that all committed transactions are on the failover node and that no data will be lost for failover, but the failover node is still in standby mode. 3) Synchronous replication where the standby node has identical transactions to the master node, and is queryable read-only. Any of these levels would be useful and allow a certain number of our users to deploy PostgreSQL in an environment where it wasn't used before. So if we can only do (2) for 8.4, that's still very useful for telecoms and banks. --Josh
Josh Berkus wrote: > > Hmmm. I thought this was pretty clear. There's three levels of synch > which are useful features: > > 1) "synchronus" standby which is really asynchronous, but only has a gap > of < 100ms. > > 2) Synchronous standby which guarentees that all committed transactions > are on the failover node and that no data will be lost for failover, but > the failover node is still in standby mode. > > 3) Synchronous replication where the standby node has identical > transactions to the master node, and is queryable read-only. > > Any of these levels would be useful.... Isn't the "queryable read-only" feature totally orthogonal with how synchronous the replication is? For one reporting system I have, where new data is continually being added every second; I'd love to have a read-only-slave even if that system has the "100ms" gap you mentioned in #1. Heck I don't care if the queries it runs even have a 100 *minute* gap; but I sure would like it to be synchronous in the sense that all the transactions to survive a failure of the primary.
> Isn't the "queryable read-only" feature totally orthogonal with > how synchronous the replication is? Yes. However, it introduces specific difficult issues which an unreadable synchronous slave does not have. --Josh
On Mon, 2008-12-15 at 13:43 -0800, Josh Berkus wrote: > > Isn't the "queryable read-only" feature totally orthogonal with > > how synchronous the replication is? > > Yes. However, it introduces specific difficult issues which an > unreadable synchronous slave does not have. Don't think it's hugely difficult, but there are multiple ways of doing this. But it is irrelevant until we have the basic ability to run queries. I've explained this twice now on different parts of this thread. Could I politely direct your attention to those posts? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon, > I've explained this twice now on different parts of this thread. Could I > politely direct your attention to those posts? Chill. I was just explaining that the *goal* of sync standby was not complicated or really something to be argued about. It's pretty clear. --Josh
On Mon, 2008-12-15 at 13:06 -0800, Josh Berkus wrote: > Peter Eisentraut wrote: > > Simon Riggs wrote: > >> I am truly lost to understand why the *name* "synchronous replication" > >> causes so much discussion, yet nobody has discussed what they would > >> actually like the software to *do* > > > > It's the color of the bikeshed ... > > Hmmm. I thought this was pretty clear. There's three levels of synch > which are useful features: > > 1) "synchronus" standby which is really asynchronous, but only has a gap > of < 100ms. > > 2) Synchronous standby which guarentees that all committed transactions > are on the failover node and that no data will be lost for failover, but > the failover node is still in standby mode. > > 3) Synchronous replication where the standby node has identical > transactions to the master node, and is queryable read-only. > Any of these levels would be useful and allow a certain number of our > users to deploy PostgreSQL in an environment where it wasn't used > before. So if we can only do (2) for 8.4, that's still very useful for > telecoms and banks. The (2) mentioned here could be any of sync points #2-5 referred to upthread. Different people have requested different levels of robustness. Looking at DRBD and Oracle, they both subdivide (2) into at least two further levels of option. So (2) is too broad a brush to paint with. I don't believe that (2) as stated is sufficient for banks, though is reasonable for many telco applications. But #4 or #5 would be suitable for banks, i.e. we must fsync to disk for very high value transactions. The extra code to do this is minor, which is why I've asked Fujii-san to include it now within the patch. All of this is controllable by the parameter synchronous_replication, which it is important to note can be set for each individual transaction rather than just fixed for the whole server. This is identical to the way we can mix synchronous commit and asynchronous commit transactions. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Hi, Sorry for this late reply. And, thanks for the hot discussion ;) On Tue, Dec 16, 2008 at 1:24 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > Fujii-san, > > Just repeating this in case you lost this comment: > > On Mon, 2008-12-15 at 09:40 +0000, Simon Riggs wrote: > >> Fujii-san, please can we incorporate those two options, rather than just >> one choice "synchronous_replication = on". They look like two commonly >> requested options. > > I see the comment in line 230+ of walreceiver.c, so understand that you > have implemented option #3 from the following list. > > So from my previous list > > 1. We sent the message to standby (A) > 2. We received the message on standby > 3. We wrote the WAL to the WAL file (B) > 4. We fsync'd the WAL file (C) > 5. We CRC checked the WAL commit record > 6. We applied the WAL commit record > > Please could you also add an option #4, i.e. add the *option* to fsync > the WAL to disk at commit time also. That requires us to add a third > option to synchronous_replication parameter. The above option should be configured on the primary? or standby? The primary is suitable to vary it from transaction to transaction. On the other hand, it should be configured on the standby in order to choose it for every standby (in the future). I prefer the latter, and thought that it should be added into recovery.conf. I mean, synchronous_replication identifies only whether commit waits for replication (if the name is confusing, I would rename it). The above options (#1-#6) are chosen in recovery.conf. What is your opion? >> #6 is an additional synchronization step in Hot Standby. I would say >> that people won't want that when they see how it performs (they probably >> won't want #4 either for that same reason, but that is for robustness). > > We can jointly add option #6 once we have both sync rep and hot standby > committed, or at a late stage of hot standby development. There's not > much point looking at it before then. Agreed. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, 2008-12-16 at 12:36 +0900, Fujii Masao wrote: > > So from my previous list > > > > 1. We sent the message to standby (A) > > 2. We received the message on standby > > 3. We wrote the WAL to the WAL file (B) > > 4. We fsync'd the WAL file (C) > > 5. We CRC checked the WAL commit record > > 6. We applied the WAL commit record > > > > Please could you also add an option #4, i.e. add the *option* to fsync > > the WAL to disk at commit time also. That requires us to add a third > > option to synchronous_replication parameter. > > The above option should be configured on the primary? or standby? > The primary is suitable to vary it from transaction to transaction. On > the other hand, it should be configured on the standby in order to > choose it for every standby (in the future). > > I prefer the latter, and thought that it should be added into recovery.conf. > I mean, synchronous_replication identifies only whether commit waits for > replication (if the name is confusing, I would rename it). The above > options (#1-#6) are chosen in recovery.conf. What is your opion? No, we've been through that loop already a few months back: Transaction-controlled robustness. It should be up to the client on the primary to decide how much waiting they would like to perform in order to provide a guarantee. A change of setting on the standby should not be allowed to alter the performance or durability on the primary. My perspective is that synchronous_replication specifies how long to wait. Current settings are "off" (don't wait) or "on" (meaning wait until point #3). So I think we should change this to a list of options to allow people to more carefully select how much waiting is required. This feature is then analogous to the way synchronous_commit works. It also provides a level of application control not seen in any other RDBMS in the industry, which makes it very suitable for large and important applications that need a fine mix of robustness and performance. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Hi, On Tue, Dec 16, 2008 at 7:21 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Tue, 2008-12-16 at 12:36 +0900, Fujii Masao wrote: > >> > So from my previous list >> > >> > 1. We sent the message to standby (A) >> > 2. We received the message on standby >> > 3. We wrote the WAL to the WAL file (B) >> > 4. We fsync'd the WAL file (C) >> > 5. We CRC checked the WAL commit record >> > 6. We applied the WAL commit record >> > >> > Please could you also add an option #4, i.e. add the *option* to fsync >> > the WAL to disk at commit time also. That requires us to add a third >> > option to synchronous_replication parameter. >> >> The above option should be configured on the primary? or standby? >> The primary is suitable to vary it from transaction to transaction. On >> the other hand, it should be configured on the standby in order to >> choose it for every standby (in the future). >> >> I prefer the latter, and thought that it should be added into recovery.conf. >> I mean, synchronous_replication identifies only whether commit waits for >> replication (if the name is confusing, I would rename it). The above >> options (#1-#6) are chosen in recovery.conf. What is your opion? > > No, we've been through that loop already a few months back: > Transaction-controlled robustness. > > It should be up to the client on the primary to decide how much waiting > they would like to perform in order to provide a guarantee. A change of > setting on the standby should not be allowed to alter the performance or > durability on the primary. OK. I will extend synchronous_replication, make walsender send XLOG with synchronization mode flag and make walreceiver perform according to the flag. > > My perspective is that synchronous_replication specifies how long to > wait. Current settings are "off" (don't wait) or "on" (meaning wait > until point #3). So I think we should change this to a list of options > to allow people to more carefully select how much waiting is required. In the latest patch, "off" keeps us waiting for replication in some cases, e.g. forceSyncCommit = true. This is analogous to the way synchronous_commit works. When "off" keeps us waiting for replication, which option (#1-#6) should we choose? Should it be user-configurable (though the parameter values are doubled)? hardcode #3? "off" always should not keep us waiting for replication? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, 2008-12-17 at 12:07 +0900, Fujii Masao wrote: > OK. I will extend synchronous_replication, make walsender send XLOG > with synchronization mode flag and make walreceiver perform according > to the flag. Sounds good. > > My perspective is that synchronous_replication specifies how long to > > wait. Current settings are "off" (don't wait) or "on" (meaning wait > > until point #3). So I think we should change this to a list of options > > to allow people to more carefully select how much waiting is required. > > In the latest patch, "off" keeps us waiting for replication in some > cases, e.g. forceSyncCommit = true. This is analogous to the way > synchronous_commit works. When "off" keeps us waiting for > replication, which option (#1-#6) should we choose? Should it be > user-configurable (though the parameter values are doubled)? > hardcode #3? "off" always should not keep us waiting for > replication? I would hard code #4, i.e. make it fsync, so that DDL changes are regarded as "high value transactions". A parameter sounds like overkill. We'd need to explain what forceSyncCommit does to users then, which is easier to avoid. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Hi, Thanks for the helpful comments! On Wed, Dec 17, 2008 at 8:50 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Wed, 2008-12-17 at 12:07 +0900, Fujii Masao wrote: > >> OK. I will extend synchronous_replication, make walsender send XLOG >> with synchronization mode flag and make walreceiver perform according >> to the flag. > > Sounds good. > >> > My perspective is that synchronous_replication specifies how long to >> > wait. Current settings are "off" (don't wait) or "on" (meaning wait >> > until point #3). So I think we should change this to a list of options >> > to allow people to more carefully select how much waiting is required. >> >> In the latest patch, "off" keeps us waiting for replication in some >> cases, e.g. forceSyncCommit = true. This is analogous to the way >> synchronous_commit works. When "off" keeps us waiting for >> replication, which option (#1-#6) should we choose? Should it be >> user-configurable (though the parameter values are doubled)? >> hardcode #3? "off" always should not keep us waiting for >> replication? > > I would hard code #4, i.e. make it fsync, so that DDL changes are > regarded as "high value transactions". > > A parameter sounds like overkill. We'd need to explain what > forceSyncCommit does to users then, which is easier to avoid. Agreed, I also think that hard code is better. But I'm nervous that "off" keeps us waiting for replication in cases other than DDL, e.g. flush buffer, truncate clog, checkpoint.. etc. synchronous_replication = off is quite similar to synchronous_commit = off. If we would hard code #4, the performance might degrade although it's asynchronous replication. So, I'd like to hard code #3. What is your opinion? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, 2008-12-18 at 11:03 +0900, Fujii Masao wrote: > Hi, > > Thanks for the helpful comments! > > On Wed, Dec 17, 2008 at 8:50 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > > > On Wed, 2008-12-17 at 12:07 +0900, Fujii Masao wrote: > > > >> OK. I will extend synchronous_replication, make walsender send XLOG > >> with synchronization mode flag and make walreceiver perform according > >> to the flag. > > > > Sounds good. > > > >> > My perspective is that synchronous_replication specifies how long to > >> > wait. Current settings are "off" (don't wait) or "on" (meaning wait > >> > until point #3). So I think we should change this to a list of options > >> > to allow people to more carefully select how much waiting is required. > >> > >> In the latest patch, "off" keeps us waiting for replication in some > >> cases, e.g. forceSyncCommit = true. This is analogous to the way > >> synchronous_commit works. When "off" keeps us waiting for > >> replication, which option (#1-#6) should we choose? Should it be > >> user-configurable (though the parameter values are doubled)? > >> hardcode #3? "off" always should not keep us waiting for > >> replication? > > > > I would hard code #4, i.e. make it fsync, so that DDL changes are > > regarded as "high value transactions". > > > > A parameter sounds like overkill. We'd need to explain what > > forceSyncCommit does to users then, which is easier to avoid. > > Agreed, I also think that hard code is better. But I'm nervous that "off" > keeps us waiting for replication in cases other than DDL, e.g. flush > buffer, truncate clog, checkpoint.. etc. synchronous_replication = off > is quite similar to synchronous_commit = off. If we would hard code #4, > the performance might degrade although it's asynchronous replication. > So, I'd like to hard code #3. What is your opinion? We don't do that when we flush buffer, truncate clog or checkpoint, not sure why you mention those. We ForceSyncCommit when we * VACUUM FULL * CREATE/DROP DATABASE or USER * Create/Drop Tablespace I don't see a problem in forcing an fsync for those. I will sleep safer knowing those guys are on disk even in async mode. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Hi, On Thu, Dec 18, 2008 at 11:19 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Thu, 2008-12-18 at 11:03 +0900, Fujii Masao wrote: >> Hi, >> >> Thanks for the helpful comments! >> >> On Wed, Dec 17, 2008 at 8:50 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> > >> > On Wed, 2008-12-17 at 12:07 +0900, Fujii Masao wrote: >> > >> >> OK. I will extend synchronous_replication, make walsender send XLOG >> >> with synchronization mode flag and make walreceiver perform according >> >> to the flag. >> > >> > Sounds good. >> > >> >> > My perspective is that synchronous_replication specifies how long to >> >> > wait. Current settings are "off" (don't wait) or "on" (meaning wait >> >> > until point #3). So I think we should change this to a list of options >> >> > to allow people to more carefully select how much waiting is required. >> >> >> >> In the latest patch, "off" keeps us waiting for replication in some >> >> cases, e.g. forceSyncCommit = true. This is analogous to the way >> >> synchronous_commit works. When "off" keeps us waiting for >> >> replication, which option (#1-#6) should we choose? Should it be >> >> user-configurable (though the parameter values are doubled)? >> >> hardcode #3? "off" always should not keep us waiting for >> >> replication? >> > >> > I would hard code #4, i.e. make it fsync, so that DDL changes are >> > regarded as "high value transactions". >> > >> > A parameter sounds like overkill. We'd need to explain what >> > forceSyncCommit does to users then, which is easier to avoid. >> >> Agreed, I also think that hard code is better. But I'm nervous that "off" >> keeps us waiting for replication in cases other than DDL, e.g. flush >> buffer, truncate clog, checkpoint.. etc. synchronous_replication = off >> is quite similar to synchronous_commit = off. If we would hard code #4, >> the performance might degrade although it's asynchronous replication. >> So, I'd like to hard code #3. What is your opinion? > > We don't do that when we flush buffer, truncate clog or checkpoint, not > sure why you mention those. > > We ForceSyncCommit when we > * VACUUM FULL > * CREATE/DROP DATABASE or USER > * Create/Drop Tablespace > > I don't see a problem in forcing an fsync for those. I will sleep safer > knowing those guys are on disk even in async mode. If my understanding is correct, XLOG flush is forced up to buffer's LSN when flushing buffer even if asynchronous commit case. Am I missing something? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, 2008-12-18 at 12:08 +0900, Fujii Masao wrote: > >> Agreed, I also think that hard code is better. But I'm nervous that "off" > >> keeps us waiting for replication in cases other than DDL, e.g. flush > >> buffer, truncate clog, checkpoint.. etc. synchronous_replication = off > >> is quite similar to synchronous_commit = off. If we would hard code #4, > >> the performance might degrade although it's asynchronous replication. > >> So, I'd like to hard code #3. What is your opinion? > > > > We don't do that when we flush buffer, truncate clog or checkpoint, not > > sure why you mention those. > > > > We ForceSyncCommit when we > > * VACUUM FULL > > * CREATE/DROP DATABASE or USER > > * Create/Drop Tablespace > > > > I don't see a problem in forcing an fsync for those. I will sleep safer > > knowing those guys are on disk even in async mode. > > If my understanding is correct, XLOG flush is forced up to buffer's LSN > when flushing buffer even if asynchronous commit case. Am I missing > something? Yes, please check the call points for ForceSyncCommit. Do I think every xlog flush should be synchronous, no, I don't. That's why we have a user settable parameter for it. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Hi, On Thu, Dec 18, 2008 at 6:35 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Thu, 2008-12-18 at 12:08 +0900, Fujii Masao wrote: > >> >> Agreed, I also think that hard code is better. But I'm nervous that "off" >> >> keeps us waiting for replication in cases other than DDL, e.g. flush >> >> buffer, truncate clog, checkpoint.. etc. synchronous_replication = off >> >> is quite similar to synchronous_commit = off. If we would hard code #4, >> >> the performance might degrade although it's asynchronous replication. >> >> So, I'd like to hard code #3. What is your opinion? >> > >> > We don't do that when we flush buffer, truncate clog or checkpoint, not >> > sure why you mention those. >> > >> > We ForceSyncCommit when we >> > * VACUUM FULL >> > * CREATE/DROP DATABASE or USER >> > * Create/Drop Tablespace >> > >> > I don't see a problem in forcing an fsync for those. I will sleep safer >> > knowing those guys are on disk even in async mode. >> >> If my understanding is correct, XLOG flush is forced up to buffer's LSN >> when flushing buffer even if asynchronous commit case. Am I missing >> something? > > Yes, please check the call points for ForceSyncCommit. > > Do I think every xlog flush should be synchronous, no, I don't. That's > why we have a user settable parameter for it. Umm.. I focus attention on XLogFlush() called except RecordTransactionCommit(). For example, FlushBuffer(), WriteTruncateXlogRec().. etc. These XLogFlush() might flush XLOG synchronously even if asynchronous commit case. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, 2008-12-19 at 09:43 +0900, Fujii Masao wrote: > > Yes, please check the call points for ForceSyncCommit. > > > > Do I think every xlog flush should be synchronous, no, I don't. > That's why we have a user settable parameter for it. > > Umm.. I focus attention on XLogFlush() called except > RecordTransactionCommit(). > For example, FlushBuffer(), WriteTruncateXlogRec().. etc. These > XLogFlush() might > flush XLOG synchronously even if asynchronous commit case. XLogFlush() flushes because of an interlock between a dirty buffer write and an outstanding WAL write. Dirty buffer writes are not replicated, so there is no need to have a similar interlock on WAL streaming. So making those call points synchronous is possible, but neither necessary or IMHO desirable. On a related but different point: We don't need an interlock between dirty buffers and WAL during recovery because the WAL has already been written. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > On a related but different point: We don't need an interlock between > dirty buffers and WAL during recovery because the WAL has already been > written. Assuming the WAL has also been fsync'd. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, 2008-12-19 at 11:04 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On a related but different point: We don't need an interlock between > > dirty buffers and WAL during recovery because the WAL has already been > > written. > > Assuming the WAL has also been fsync'd. True, so this will need to change for 8.4 also -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Hi, Mark Mielke wrote: > Where does the expectation come from? I find the seat reservation, bank account or stock trading examples pretty obvious WRT user expectations. Nonetheless, I've compiled some hints from the documentation and sources: "Since in Read Committed mode each new command starts with a new snapshot that includes all transactions committed up to that instant" [1]. "This [SERIALIZABLE ISOLATION] level emulates serial transaction execution, as if transactions had been executed one after another, serially, rather than concurrently." [1]. (IMO this implies, that a transaction "sees" changes from all preceding transactions). "All changes made by the transaction become visible to others and are guaranteed to be durable if a crash occurs." [2]. (Agreed, it's not overly clear here, when exactly the changes become visible. OTOH, there's no warning, that another session doesn't immediately see committed transactions. Not sure where you got that from). > I don't recall ever reading it in > the documentation, and unless the session processes are contending over > the integers (using some sort of synchronization primitive) in memory > that represent the "latest visible commit" on every single select, I'm > wondering how it is accomplished? See the transaction system's README [3]. It documents the process of snapshot taking and transaction isolation pretty well. Around line 226 it says: "What we actually enforce is strict serialization of commits and rollbacks with snapshot-taking". (So the outcome of your experiment is no surprise at all). And a bit later: "This rule is stronger than necessary for consistency, but is relatively simple to enforce, and it assists with some other issues as explained below.". While this implies, that an optimization is theoretically possible, I very much doubt it would be worth it (for a single node system). In a distributed system, things are a bit different. Network latency is an order of magnitude higher than memory latency (for IPC). So a similar optimization is very well worth it. However, the application (or the load balancer or both) need to know about this potential lag between nodes. And as you've outlined elsewhere, a limit for how much a single node may lag behind needs to be established. (As a side note: for a multi-master system like Postgres-R, it's beneficial to keep the lag time as low as possible, because the larger the lag, the higher the probability for a conflict between two transactions on different nodes.) Regards Markus Wanner [1]: Pg 8.3 Docu: Concurrency Control: http://www.postgresql.org/docs/8.3/static/transaction-iso.html [2]: Pg 8.3 Docu: COMMIT command: http://www.postgresql.org/docs/8.3/static/sql-commit.html [3]: README of transam (src/backend/access/transam/README): https://projects.commandprompt.com/public/pgsql/browser/trunk/pgsql/src/backend/access/transam/README#L224
Good answers, Markus. Thanks.<br /><br /> I've bought the thinking of several here that the user should have some controlover what they expect (and what optimizations they are willing to accept as a good choice), but that commit shouldstill be able to have a capped time limit.<br /><br /> I can think of many of my own applications where I would chooseone mode vs another mode, even within the same application, depending on the operation itself. The most important requirementis that transactions are durable. It becomes convenient, though, to provide additional guarantees for some operationsequences.<br /><br /> I still see the requirement for seat reservation, bank account, or stock trading, as synchronizingusing read-write locks before starting the select, rather than enforcing latest on every select.<br /><br />For my own bank, when I do an online transaction, operations don't always immediately appear in my list of transactions.They appear to sometimes be batched, sometimes in near real time, and sometimes as part of some sort of dayend processing.<br /><br /> For seat reservation, the time the seat layout is shown on the screen is not usually lockedduring a transaction. Between the time the travel agent brings up the seats on the plane, and the time they selectthe seat, the seat could be taken. What's important is that the reservation is durable, and that conflicts are notintroduced. The commit must fail if another person has chosen the seat already already. The commit does not need to waituntil the reservation is pushed out to all systems before completing. The same is true of stock trading.<br /><br />However, it can be very convenient for commits to be immediately visible after the commit completes. This allows for laziermodels, such as a web site that reloads the view on the reservations or recent trades and expects to see recent commitsno matter which server it accesses, rather than taking into account that the commit succeeded when presenting thenext view.<br /><br /> If I look at sites like Google - they take the opposite extreme. I can post a message, and it remembersthat I posted the message and makes it immediately visible, however, I might not see other new messages in a threaduntil a minute or more later.<br /><br /> So it looks like there is value to both ends of the spectrum, and while Ifeel the most value would be in providing a very fast system that scales near linear to the number of nodes in the system,even at the expense of immediately visible transactions from all servers, I can accept that sometimes the expectationsare stricter and would appreciate seeing an option to let me choose based upon my requirements.<br /><br /> Cheers,<br/> mark<br /><br /><br /> Markus Wanner wrote: <blockquote cite="mid:494CFFFF.2060200@bluegap.ch" type="cite"><prewrap="">Hi, Mark Mielke wrote: </pre><blockquote type="cite"><pre wrap="">Where does the expectation come from? </pre></blockquote><prewrap=""> I find the seat reservation, bank account or stock trading examples pretty obvious WRT user expectations. Nonetheless, I've compiled some hints from the documentation and sources: "Since in Read Committed mode each new command starts with a new snapshot that includes all transactions committed up to that instant" [1]. "This [SERIALIZABLE ISOLATION] level emulates serial transaction execution, as if transactions had been executed one after another, serially, rather than concurrently." [1]. (IMO this implies, that a transaction "sees" changes from all preceding transactions). "All changes made by the transaction become visible to others and are guaranteed to be durable if a crash occurs." [2]. (Agreed, it's not overly clear here, when exactly the changes become visible. OTOH, there's no warning, that another session doesn't immediately see committed transactions. Not sure where you got that from). </pre><blockquote type="cite"><pre wrap="">I don't recall ever reading it in the documentation, and unless the session processes are contending over the integers (using some sort of synchronization primitive) in memory that represent the "latest visible commit" on every single select, I'm wondering how it is accomplished? </pre></blockquote><pre wrap=""> See the transaction system's README [3]. It documents the process of snapshot taking and transaction isolation pretty well. Around line 226 it says: "What we actually enforce is strict serialization of commits and rollbacks with snapshot-taking". (So the outcome of your experiment is no surprise at all). And a bit later: "This rule is stronger than necessary for consistency, but is relatively simple to enforce, and it assists with some other issues as explained below.". While this implies, that an optimization is theoretically possible, I very much doubt it would be worth it (for a single node system). In a distributed system, things are a bit different. Network latency is an order of magnitude higher than memory latency (for IPC). So a similar optimization is very well worth it. However, the application (or the load balancer or both) need to know about this potential lag between nodes. And as you've outlined elsewhere, a limit for how much a single node may lag behind needs to be established. (As a side note: for a multi-master system like Postgres-R, it's beneficial to keep the lag time as low as possible, because the larger the lag, the higher the probability for a conflict between two transactions on different nodes.) Regards Markus Wanner [1]: Pg 8.3 Docu: Concurrency Control: <a class="moz-txt-link-freetext" href="http://www.postgresql.org/docs/8.3/static/transaction-iso.html">http://www.postgresql.org/docs/8.3/static/transaction-iso.html</a> [2]: Pg 8.3 Docu: COMMIT command: <a class="moz-txt-link-freetext" href="http://www.postgresql.org/docs/8.3/static/sql-commit.html">http://www.postgresql.org/docs/8.3/static/sql-commit.html</a> [3]: README of transam (src/backend/access/transam/README): <a class="moz-txt-link-freetext" href="https://projects.commandprompt.com/public/pgsql/browser/trunk/pgsql/src/backend/access/transam/README#L224">https://projects.commandprompt.com/public/pgsql/browser/trunk/pgsql/src/backend/access/transam/README#L224</a> </pre></blockquote><br /><br /><pre class="moz-signature" cols="72">-- Mark Mielke <a class="moz-txt-link-rfc2396E" href="mailto:mark@mielke.cc"><mark@mielke.cc></a> </pre>
Hi, Mark Mielke wrote: > Robert Haas wrote: >> On Sat, Dec 13, 2008 at 1:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> We won't call it anything, because we never will or can implement that. >>> See the theory of relativity: the notion of exactly simultaneous events >> >> OK, fine. I'll be more precise. I think we need to reserve the term >> "synchronous replication" for a system where transactions that begin >> on the standby after the transactions has committed on the master see >> the effects of the committed transaction. I agree with Robert here. As far as I know this is the common understanding of "synchronous replication". Everything less - including Postgres-R - is considered to be asynchronous. > I'd like to see proof of some sort that PostgreSQL guarantees that the > instant a 'commit' returns, any transactions already open with the > appropriate transaction isolation level, or any new sessions *will* see > the results of the commit. Given within this thread, here [1]. > Two phase commit doesn't imply that the transaction is guaranteed to be > immediately visible. Just for the record: that's plain wrong. As with any other transaction, a COMMIT of a prepared transaction guarantees visibility from all subsequent snapshots (at least for Postgres and other serious RDBSen). Systems based on 2PC are the typical synchronous replication solution: works, resistant to failures, consistent across nodes (WRT visibility), but unusably slow. This is what people have in mind and expect when they hear "synchronous replication" for databases. (And which is why I'm thinking it's better for an optimized solution not to call itself "synchronous"). > Unless transactions are > locked from starting until they are able to prove that they have the > latest commit See the cited README. It already happens for (single node) Postgres systems, because the action of snapshot taking and committing are serialized. > (a feat which I'm going to theorize as impossible - > because the moment you wait for a commit, and you begin again, you > really have no guarantee that another commit has not occurred in the > mean time) This problem is solved by locking. Regards Markus Wanner [1]: Hints to docs and source, that COMMIT actually ensures subsequent snapshots "include" changes of the committed transaction: http://archives.postgresql.org/message-id/494CFFFF.2060200@bluegap.ch
Hi, Josh Berkus wrote: > Peter Eisentraut wrote: >> It's the color of the bikeshed ... Agreed. It's why I've decided to support various modes for Postgres-R. I'm glad to see that the current "Sync Rep" approach does the same. > Hmmm. I thought this was pretty clear. There's three levels of synch > which are useful features: > > 1) "synchronus" standby which is really asynchronous, but only has a gap > of < 100ms. A synchronous standby which is really asynchronous? That's exactly the naming challenge I've been pointing to. Commonly used terms are: "virtually synchronous", "approximately synchronous", "near-real-time replication" or "eager replication", but for most users, this is not "synchronous" (enough). (BTW: there's no such "< 100 ms" guarantee. It may be typically below 100 ms, or even below 10 ms on average. But replication is not about the typical or average case. It's much more about failures and uncommon cases. The guarantee you can get in such a system (by declaring a node as dead) is much more likely to be within the range of several seconds and more, be it network, disk or whatever other failure-timeout that applies here.) > 2) Synchronous standby which guarentees that all committed transactions > are on the failover node and that no data will be lost for failover, but > the failover node is still in standby mode. What's the difference to 1) here? I'm not following. > 3) Synchronous replication where the standby node has identical > transactions to the master node, and is queryable read-only. So, a synchronous standby is different from synchronous replication in that it's asynchronous? Sorry for bugging with naming, but I think it is important for an understanding during development. > Any of these levels would be useful and allow a certain number of our > users to deploy PostgreSQL in an environment where it wasn't used > before. I absolutely agree to that statement. However, please do not confuse future users (and today's hackers), but instead use existing terms consistently and clearly. Something that lags behind, potentially by several seconds (in case of failure) is commonly considered asynchronous, no matter how close to "immediate" it is on average. Regards Markus Wanner
Hi, Mark Mielke wrote: > Good answers, Markus. Thanks. You are welcome. > So it looks like there is value to both ends of the spectrum, and while > I feel the most value would be in providing a very fast system that > scales near linear to the number of nodes in the system, even at the > expense of immediately visible transactions from all servers, I can > accept that sometimes the expectations are stricter and would appreciate > seeing an option to let me choose based upon my requirements. I absolutely agree to that. The original Postgres-R algorithm covers the eager (or virtually synchronous) part. I'm planning to extend it with a (fully) synchronous mode and let the user choose per transaction. Regards Markus Wanner
Hi, On Fri, Dec 19, 2008 at 5:50 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Fri, 2008-12-19 at 09:43 +0900, Fujii Masao wrote: > >> > Yes, please check the call points for ForceSyncCommit. >> > >> > Do I think every xlog flush should be synchronous, no, I don't. >> That's why we have a user settable parameter for it. >> >> Umm.. I focus attention on XLogFlush() called except >> RecordTransactionCommit(). >> For example, FlushBuffer(), WriteTruncateXlogRec().. etc. These >> XLogFlush() might >> flush XLOG synchronously even if asynchronous commit case. > > XLogFlush() flushes because of an interlock between a dirty buffer write > and an outstanding WAL write. Dirty buffer writes are not replicated, so > there is no need to have a similar interlock on WAL streaming. > > So making those call points synchronous is possible, but neither > necessary or IMHO desirable. Yes in upcoming 8.4, but probably no in the future. What if the primary fails after writing the dirty data buffer before sending the corresponding logs? This would make data on the primary and logs on the standby inconsistent. In 8.4, such inconsistency might not matter because we don't use the data on the failed primary for recovery (when restarting the failed server, we always need a fresh backup). But, since this restriction is not good for some people, in the future, the failed server should restart without a fresh backup, and the inconsistency would be problem. So, I think that the inconsistency should be removed even if asynchronous replication case, and we should enforce "WAL rule" over some servers. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Hi, Simon Riggs wrote: > The second way can be done by taking a snapshot on the primary, with an > associated LSN, then using that snapshot on the standby. That is > somewhat complex, but possible. I see the requirement for getting the > same answer on multiple nodes as a further extension of "transaction > isolation mode" and think that not all people will want this, so we > should allow that as an option. I've been thinking a bit about this pretty interesting idea. It's certainly of interest for Postgres-R as well. AFAIK a function could simply wait, until the node which is being queried reaches a given point in time of application of transactions (an LSN, in the Sync-Rep world). Calling such a waiting function just after BEGIN would ensure to see (at least) the given snapshot. If that snapshot has already been reached or passed, the function does nothing. What I like is, that it's optimistic in that the wait is only enforced when needed by the reader. However, unlike enforcing the wait before COMMIT, it requires changing the application to cope with this behavior of the distributed database system. And knowing when to require which snapshot sounds rather difficult from the point of view of the application developer. Also note, that it might be the issuer of the transaction who wants to ensure "his" transaction got propagated to the remote nodes. > I'm not going to worry about this at the moment. Hot standby will be > useful without this and so I regard this as a secondary objective. Rome > wasn't built in a single release, or something like that. Sounds like a decent plan. Good luck. Regards Markus Wanner
Hi, Emmanuel Cecchet wrote: > What the application is going to see is a failure when the postmaster it > is connected to is going down. If this happen at commit time, I think > that there is no guarantee for the application to know what happened: > 1. failure occurred before the request reached postmaster: no instance > committed > 2. failure occurred during commit: might be committed on either nodes > 3. failure occurred while sending back ack of commit to client: both > instances have committed > But for the client, it will all look the same: an error on commit. This is very much the same for a single node database system, so I think current application developers are used to that behavior. A distributed database system just needs to make sure, that committed transactions can and will eventually get committed on all nodes. So in case the client doesn't receive a COMMIT acknowledgment due to atny kind of failure, it can only be sure the distributed database is in a consistent state. The client cannot tell, if it applied the transaction or not. Much like for single node systems. I agree, that a distributed database system could theoretically do better, but that would require changes on the client side connection library as well, letting the client connect to two or even more nodes of the distributed database system. > This is just to point out that despite all your efforts, the client > might think that some transactions have failed (error on commit) but > they are actually committed. As pointed out above, that would currently be an erroneous conclusion. > If you don't put some state in the driver > that is able to check at failover time if the commit operation succeeded > or not, it does not really matter what happens for in-flight > transactions (or in-commit transactions) at failure time. Sure it does. The database system still needs to guarantee consistency. Hm.. well, you're right if there's only one single standby left (as is obviously the case for the proposed Sync Rep). Ensuring consistency is pretty simple in such a case. But imagine having two or more standby servers. Those would need to agree on a set of in-flight transactions from the master they both need to apply. > Actually, if there was a way to query the database about the status of a > particular transaction by providing a cluster-wide unique id, that would > help a lot. You're certainly aware, that Postgres-R features such a global transaction id... And I guess Sync Rep could easily add such an identifier as well. > I wrote a paper on the issues with database replication at > Sigmod earlier this year (http://infoscience.epfl.ch/record/129042). > Even though it was targeted at middleware replication, I think that some > of it is still relevant for the problem at hand. Interesting read, thanks for the pointer. It's pretty obvious that I don't consider Postgres-R to be obsolete... I've linked your paper from www.postgres-r.org [1]. > Regarding the wording, if experts can't agree, you can be sure that > users won't either. Most of them don't have a clue about the different > flavors of replication. So as long as you state clearly how it behaves > and define all the terms you use that should be fine. I mostly agree to that, without repeating my concerns again, here. Regards Markus Wanner [1]: Referenced Papers from Postgres-R website: http://www.postgres-r.org/documentation/references
Hi Markus, I am happy to see that Postgres-R is alive again. The paper was written in 07 (and published in 08, the review process is longer than a CommitFest ;-)) and at the time of the writing there was no version of Postgres-R available, hence the 'obsolete' mention referring to past versions. I think that it is legitimate for users to expect more guarantees from a replicated database than from a single database. Not knowing what happen when a failure happens at commit time when some nodes are still active in a cluster is not intuitive for users. I did not look at the source, but if Postgres -R continue to elaborate on Bettina's ideas with writeset extraction and a certification protocol, I think that it will be a bad idea to try to mix it with Sync Rep (mentioned in another thread). If you delay commits, you will increase the window for transactions to conflict and therefore induce a higher abort rate (thus less scalability). Certification-based approaches have already multiple reliability issues to improve write performance compared to statement-based replication, but this is very dependent on the capacity of the system to limit the conflicting window for concurrent transactions. The writeset extraction mechanisms have had too many limitations so far to allow the use of certification-based replication in production (AFAIK). Good luck with Postgres-R. Emmanuel > Emmanuel Cecchet wrote: > >> What the application is going to see is a failure when the postmaster it >> is connected to is going down. If this happen at commit time, I think >> that there is no guarantee for the application to know what happened: >> 1. failure occurred before the request reached postmaster: no instance >> committed >> 2. failure occurred during commit: might be committed on either nodes >> 3. failure occurred while sending back ack of commit to client: both >> instances have committed >> But for the client, it will all look the same: an error on commit. >> > > This is very much the same for a single node database system, so I think > current application developers are used to that behavior. > > A distributed database system just needs to make sure, that committed > transactions can and will eventually get committed on all nodes. So in > case the client doesn't receive a COMMIT acknowledgment due to atny kind > of failure, it can only be sure the distributed database is in a > consistent state. The client cannot tell, if it applied the transaction > or not. Much like for single node systems. > > I agree, that a distributed database system could theoretically do > better, but that would require changes on the client side connection > library as well, letting the client connect to two or even more nodes of > the distributed database system. > > >> This is just to point out that despite all your efforts, the client >> might think that some transactions have failed (error on commit) but >> they are actually committed. >> > > As pointed out above, that would currently be an erroneous conclusion. > > >> If you don't put some state in the driver >> that is able to check at failover time if the commit operation succeeded >> or not, it does not really matter what happens for in-flight >> transactions (or in-commit transactions) at failure time. >> > > Sure it does. The database system still needs to guarantee consistency. > > Hm.. well, you're right if there's only one single standby left (as is > obviously the case for the proposed Sync Rep). Ensuring consistency is > pretty simple in such a case. But imagine having two or more standby > servers. Those would need to agree on a set of in-flight transactions > from the master they both need to apply. > > >> Actually, if there was a way to query the database about the status of a >> particular transaction by providing a cluster-wide unique id, that would >> help a lot. >> > > You're certainly aware, that Postgres-R features such a global > transaction id... And I guess Sync Rep could easily add such an > identifier as well. > > >> I wrote a paper on the issues with database replication at >> Sigmod earlier this year (http://infoscience.epfl.ch/record/129042). >> Even though it was targeted at middleware replication, I think that some >> of it is still relevant for the problem at hand. >> > > Interesting read, thanks for the pointer. It's pretty obvious that I > don't consider Postgres-R to be obsolete... > > I've linked your paper from www.postgres-r.org [1]. > > >> Regarding the wording, if experts can't agree, you can be sure that >> users won't either. Most of them don't have a clue about the different >> flavors of replication. So as long as you state clearly how it behaves >> and define all the terms you use that should be fine. >> > > I mostly agree to that, without repeating my concerns again, here. > > Regards > > Markus Wanner > > > [1]: Referenced Papers from Postgres-R website: > http://www.postgres-r.org/documentation/references > > -- Emmanuel Cecchet FTO @ Frog Thinker Open Source Development & Consulting -- Web: http://www.frogthinker.org email: manu@frogthinker.org Skype: emmanuel_cecchet
Hello Emmanuel, Emmanuel Cecchet wrote: > I am happy to see that Postgres-R is alive again. The paper was written > in 07 (and published in 08, the review process is longer than a > CommitFest ;-)) and at the time of the writing there was no version of > Postgres-R available, hence the 'obsolete' mention referring to past > versions. Understood. > I think that it is legitimate for users to expect more guarantees from a > replicated database than from a single database. Not knowing what happen > when a failure happens at commit time when some nodes are still active > in a cluster is not intuitive for users. I absolutely agree to that. However, it's lower priority for me. > I did not look at the source, but if Postgres -R continue to elaborate > on Bettina's ideas with writeset extraction and a certification > protocol, I think that it will be a bad idea to try to mix it with Sync > Rep (mentioned in another thread). I'm not quite sure what you mean by "certification protocol", there's no such thing in Postgres-R (as proposed by Kemme). Although, I remember having heard that term in the context of F. Pedone's work. Can you point me to some paper explaining this certification protocol? > If you delay commits, you will > increase the window for transactions to conflict and therefore induce a > higher abort rate (thus less scalability). This assumes that *all* types of transactions are unlikely to conflict. But there sometimes just are transactions with a very high probability for conflicts with other transactions. Applying optimistic locking (as the original Postgres-R algorithm does) cannot be efficient in such a case, because of lots of useless retries. (It could even lead to starvation of long running transactions, which always get aborted be shorter conflicting ones). > Certification-based > approaches have already multiple reliability issues to improve write > performance compared to statement-based replication, but this is very > dependent on the capacity of the system to limit the conflicting window > for concurrent transactions. What do you mean by "reliability issues"? Keeping the "conflicting window" as narrow as possibly certainly benefits performance, yes. But keeping the retry rate low also helps a lot (and influences the conflict window in turn). > The writeset extraction mechanisms have had > too many limitations so far to allow the use of certification-based > replication in production (AFAIK). What limitations are you speaking of here? > Good luck with Postgres-R. Thank you. Regards Markus Wanner
Hi, On Wed, Dec 17, 2008 at 12:07 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> No, we've been through that loop already a few months back: >> Transaction-controlled robustness. >> >> It should be up to the client on the primary to decide how much waiting >> they would like to perform in order to provide a guarantee. A change of >> setting on the standby should not be allowed to alter the performance or >> durability on the primary. > > OK. I will extend synchronous_replication, make walsender send XLOG > with synchronization mode flag and make walreceiver perform according > to the flag. Not so simple. At least the primary has to additionally maintain the byte position the standby has already fsynced. The main difference from the current patch is whether the standby fsyncs the logfile when it fills even if you don't choose #4(fsync). In order to prevent from having to go back and re-open prior logfiles when an fsync request comes along later, we would need to ignore the sync mode and make the standby fsync the logfile when it fills. This would degrade the performance periodically. Is this acceptable? I think there are four choices. Which do you prefer? 1) Accept the above change. 2) Go back and re-open prior logfiles when a fsync request comes along. 3) Stop the sync control by the primary and leave it to the standby. 4) Add new option to specify whether to permit optimistic fsync, this option makes the standby fsync only the current logfilewhen a fsync request comes along (don't go back and re-open prior logfiles). 2) would cause another performance degradation. 4) would furthermore confuse users about setting a sync mode. So, I prefer 3) though I'm sorry for digging up the discussion about transaction control. Please feel free to comment! Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Hi Markus, > I'm not quite sure what you mean by "certification protocol", there's no > such thing in Postgres-R (as proposed by Kemme). Although, I remember > having heard that term in the context of F. Pedone's work. Can you point > me to some paper explaining this certification protocol? > What Bettina calls the Lock Phase in http://www.cs.mcgill.ca/~kemme/papers/vldb00.pdf is actually a certification. You can find more references to certification protocols in http://gorda.di.uminho.pt/download/reports/gapi.pdf I would also recommend the work of Sameh on Tashkent and Taskent+ that was based on Postgres: http://labos.epfl.ch/webdav/site/labos/users/157494/public/papers/tashkent.eurosys2006.pdf and http://infoscience.epfl.ch/record/97654/files/tashkentPlus.eurosys2007.final.pdf >> Certification-based >> approaches have already multiple reliability issues to improve write >> performance compared to statement-based replication, but this is very >> dependent on the capacity of the system to limit the conflicting window >> for concurrent transactions. >> > > What do you mean by "reliability issues"? > These approaches usually require an atomic broadcast primitive that is usually fragile (limited scalability, hard to tune failure timeouts, ). Most prototype implementations have the load balancer and/or the certifier as a SPOF (single point of failure). Building reliability for these components will come with a significant performance penalty. >> The writeset extraction mechanisms have had >> too many limitations so far to allow the use of certification-based >> replication in production (AFAIK). >> > What limitations are you speaking of here? > Oftentimes DDL support is very limited. Non-transactional objects like sequences are not captured. Session or environment variables are not necessarily propagated. Support of temp tables varies between databases which makes it hard to support them properly in a generic way. Well I guess everyone has a story on some limitations it has found with some database replication technology especially when a user expects a cluster to behave like a single database instance. Happy holidays, Emmanuel -- Emmanuel Cecchet FTO @ Frog Thinker Open Source Development & Consulting -- Web: http://www.frogthinker.org email: manu@frogthinker.org Skype: emmanuel_cecchet
On Sun, 2008-12-21 at 14:46 +0900, Fujii Masao wrote: > > XLogFlush() flushes because of an interlock between a dirty buffer write > > and an outstanding WAL write. Dirty buffer writes are not replicated, so > > there is no need to have a similar interlock on WAL streaming. > > > > So making those call points synchronous is possible, but neither > > necessary or IMHO desirable. > > Yes in upcoming 8.4, but probably no in the future. > > What if the primary fails after writing the dirty data buffer before sending > the corresponding logs? This would make data on the primary and logs > on the standby inconsistent. In 8.4, such inconsistency might not matter > because we don't use the data on the failed primary for recovery (when > restarting the failed server, we always need a fresh backup). But, since > this restriction is not good for some people, in the future, the failed server > should restart without a fresh backup, and the inconsistency would be > problem. So, I think that the inconsistency should be removed even if > asynchronous replication case, and we should enforce "WAL rule" over > some servers. I don't get this argument. Why would we care what happens on the failed server? The additional synchronizations you suggest are neither necessary, nor IMHO desirable. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Hi, On Tue, Dec 23, 2008 at 5:22 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Sun, 2008-12-21 at 14:46 +0900, Fujii Masao wrote: > >> > XLogFlush() flushes because of an interlock between a dirty buffer write >> > and an outstanding WAL write. Dirty buffer writes are not replicated, so >> > there is no need to have a similar interlock on WAL streaming. >> > >> > So making those call points synchronous is possible, but neither >> > necessary or IMHO desirable. >> >> Yes in upcoming 8.4, but probably no in the future. >> >> What if the primary fails after writing the dirty data buffer before sending >> the corresponding logs? This would make data on the primary and logs >> on the standby inconsistent. In 8.4, such inconsistency might not matter >> because we don't use the data on the failed primary for recovery (when >> restarting the failed server, we always need a fresh backup). But, since >> this restriction is not good for some people, in the future, the failed server >> should restart without a fresh backup, and the inconsistency would be >> problem. So, I think that the inconsistency should be removed even if >> asynchronous replication case, and we should enforce "WAL rule" over >> some servers. > > I don't get this argument. Why would we care what happens on the failed server? It's because, in the future, I'd like to use the data on the failed server when making it catch up with new primary. This desire might be violated by the inconsistency which I described. > > The additional synchronizations you suggest are neither necessary, nor > IMHO desirable. Not additional. It's quite analogous to synchronous_commit. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, 2008-12-23 at 18:00 +0900, Fujii Masao wrote: > > I don't get this argument. Why would we care what happens on the > failed server? > > It's because, in the future, I'd like to use the data on the failed > server when making it catch up with new primary. This desire might be > violated by the inconsistency which I described. I don't really understand why you would put something in there that has no use at all. Why make every server in the world do extra synchronisation? Whatever you build in the future can include this, if that is still a required point at the time you add the new feature. Are you thinking about switchover rather than failover? I'm sure a graceful switchover doesn't need this. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Hi, On Tue, Dec 23, 2008 at 6:28 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Tue, 2008-12-23 at 18:00 +0900, Fujii Masao wrote: >> > I don't get this argument. Why would we care what happens on the >> failed server? >> >> It's because, in the future, I'd like to use the data on the failed >> server when making it catch up with new primary. This desire might be >> violated by the inconsistency which I described. > > I don't really understand why you would put something in there that has > no use at all. Why make every server in the world do extra > synchronisation? > > Whatever you build in the future can include this, if that is still a > required point at the time you add the new feature. Right. But since it's difficult to change the once fixed specification, I ruminate about it from now for future. But, since I cannot obtain consensus from hackers including you, I would change my course, and forbid XLogFlush (called from other than RecordTransactionCommit) to replicate xlog synchronously if asynchronous replication case. BTW, here is the callers other than RecordTransactionCommit. - CreateCheckPoint() - EndPrepare() - FlushBuffer() - RecordTransactionAbortPrepared() - RecordTransactionCommitPrepared() - RelationTruncate() - SlruPhysicalWritePage() - WriteTruncateXlogRec() - XLogAsyncCommitFlush() > > Are you thinking about switchover rather than failover? I'm sure a > graceful switchover doesn't need this. Yes, switchover is one of case example I care. Typically, I care about restarting the failed server (original primary) after failover: ------------- 1. a dirty buffer page is chosen as victim of buffer replacement 2. flush xlog up to the buffer's LSN on only primary 3. write out the dirty buffer page 4. primary fails (replication up to buffer's LSN is not performed) The above case produces inconsistency between data on the original primary (failed server) and xlogs on the original standby (new primary after failover). Isn't this right? 5. restart the failed server and make it catch up with new primary We cannot recycle the existing data on the failed server because of that inconsistency. I think this restriction should be removed. ------------- Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, Dec 23, 2008 at 4:23 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > > > But, since I cannot obtain consensus from hackers including you, > I would change my course, and forbid XLogFlush (called from other > than RecordTransactionCommit) to replicate xlog synchronously > if asynchronous replication case. > Since synchronous/asynchronous behavior of replication is tied to a transaction (even if there is global default) , I don't understand why we should not ship the xlogs to the standby when xlogs are written on primary outside of a transaction context. This is quite same as we do with asynchronous_commit where we flush the xlog to disk at certain points irrespective of the synchronization set. > Yes, switchover is one of case example I care. Typically, I care > about restarting the failed server (original primary) after failover: > I think this is a very important requirement because it's quite unrealistic to expect that every time there is a failover, fresh backup is required for the old primary to join back the replication. > ------------- > 1. a dirty buffer page is chosen as victim of buffer replacement > 2. flush xlog up to the buffer's LSN on only primary > 3. write out the dirty buffer page > 4. primary fails > (replication up to buffer's LSN is not performed) > > The above case produces inconsistency between data on the > original primary (failed server) and xlogs on the original standby > (new primary after failover). Isn't this right? > Yes, it would create inconsistency which I don't think can be corrected without a fresh backup. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
On Tue, 2008-12-23 at 16:54 +0530, Pavan Deolasee wrote: > On Tue, Dec 23, 2008 at 4:23 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > > > > But, since I cannot obtain consensus from hackers including you, > > I would change my course, and forbid XLogFlush (called from other > > than RecordTransactionCommit) to replicate xlog synchronously > > if asynchronous replication case. > > Since synchronous/asynchronous behavior of replication is tied to a > transaction (even if there is global default) , I don't understand why > we should not ship the xlogs to the standby when xlogs are written on > primary outside of a transaction context. This is quite same as we do > with asynchronous_commit where we flush the xlog to disk at certain > points irrespective of the synchronization set. We stream constantly from primary to standby. That point is not being debated. The issue is whether we should add additional synchronisation points (i.e. additional times we need to wait) into the WAL stream. Currently, I have said no because this has no purpose in the current design: definitely not performance, not robustness, not code clarity. Specifically, we're talking about slowing down WAL flushes required because of dirty page replacement, amongst others. That's not something I want to see slowed down on a server that has specifically opted for asynchronous replication, presumably because of a slow link. The other call points are also potential contention points. > > Yes, switchover is one of case example I care. Typically, I care > > about restarting the failed server (original primary) after failover: > > > > I think this is a very important requirement because it's quite > unrealistic to expect that every time there is a failover, fresh > backup is required for the old primary to join back the replication. I personally don't expect that, because we have rsync. If that is a very important requirement then the current software needs to include all the aspects of a feature, not just some of them. Either we include a whole feature or we leave it out. A release will need to stand for 5+ years, so supporting extraneous features is troublesome and wasteful. Currently, Fujii-san has stated he is not planning to allow fast resynchronization in 8.4, so why would we need this? If we were to add fast resynchronisation as a feature in 8.4, then I will be happy to have *all* required changes included. People mention it enough that I would be happy to see the whole feature added in this release -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Tue, Dec 23, 2008 at 5:55 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > > We stream constantly from primary to standby. That point is not being > debated. The issue is whether we should add additional synchronisation > points (i.e. additional times we need to wait) into the WAL stream. > Currently, I have said no because this has no purpose in the current > design: definitely not performance, not robustness, not code clarity. > > Specifically, we're talking about slowing down WAL flushes required > because of dirty page replacement, amongst others. That's not something > I want to see slowed down on a server that has specifically opted for > asynchronous replication, presumably because of a slow link. The other > call points are also potential contention points. So we would still be sending WAL to standby at XLogWrite time (and I think that's necessary). The question is whether we should wait for standby ack at XLogFlush time, right ? Hmm. I think the argument for that would be what Fujii-san described for maintaining consistency between data and WAL. I agree with you that we should add additional synchronization points only if they give us any real value in administrating replication setup. Personally, I would like to have a simple setup where I can initially setup primary and standby and they continue to work in a single-failure mode without any additional administrative overhead (such as rsync). But that's just me and I don't know what the preferred option in the field. BTW, I won't be too much worried about dirty buffer case because the WAL synchronization at that point usually occurs much later than the WAL is actually sent to the standby. I would imagine that most of the time WAL would have made to standby by that time. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
On Tue, 2008-12-23 at 18:36 +0530, Pavan Deolasee wrote: > Personally, I would like to have a > simple setup where I can initially setup primary and standby and they > continue to work in a single-failure mode without any additional > administrative overhead (such as rsync). But that's just me and I > don't know what the preferred option in the field. If you want a tripod, you need to turn up with all 3 legs. :-) PostgreSQL is a working product, not a framework or a function library. We're not going to add code that has no function at all other than as part of a larger feature, unless we add the whole feature. I'm happy if that whole feature is added. If we do add it, it will be a utility like "pg_resync". So in admin terms it will be almost identical to using rsync, just a specific version that minimizes effort even more than rsync does currently. The only difference as I see it would be some gain in performance, but we don't need to send the whole database down the wire again in either case. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Hi, On Tue, Dec 23, 2008 at 10:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > I'm happy if that whole feature is added. If we do add it, it will be a > utility like "pg_resync". So in admin terms it will be almost identical > to using rsync, just a specific version that minimizes effort even more > than rsync does currently. The only difference as I see it would be some > gain in performance, but we don't need to send the whole database down > the wire again in either case. I think that the type of your user is different from mine. If server fails by simple termination of process, I don't want to spend 1min for restarting other than catching up itself. For me, getting a fresh backup (not only copying backup data but also checkpoint by pg_start_backup) is expensive operation. Of course, since I'm not planning to tackle that problem in 8.4, I would not add "additional" synchronization point. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Hi, On Tue, Dec 23, 2008 at 11:31 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > Of course, since I'm not planning to tackle that problem in 8.4, > I would not add "additional" synchronization point. Second thought: For normal shutdown case, we probably should force synchronous replication in CreateCheckPoint at least. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, 2008-12-23 at 23:31 +0900, Fujii Masao wrote: > Hi, > > On Tue, Dec 23, 2008 at 10:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > I'm happy if that whole feature is added. If we do add it, it will be a > > utility like "pg_resync". So in admin terms it will be almost identical > > to using rsync, just a specific version that minimizes effort even more > > than rsync does currently. The only difference as I see it would be some > > gain in performance, but we don't need to send the whole database down > > the wire again in either case. > > I think that the type of your user is different from mine. Perhaps, but why do you say that? I've not blocked you from adding anything useful to Postgres. > If server fails > by simple termination of process, I don't want to spend 1min for > restarting other than catching up itself. For me, getting a fresh backup > (not only copying backup data but also checkpoint by pg_start_backup) > is expensive operation. As I said: "I'm happy if that whole feature is added." You scare me that you see failover as sufficiently frequent that you are worried that being without one of the servers for an extra 60 seconds during a failover is a problem. And then say you're not going to add the feature after all. I really don't understand. If its important, add the feature, the whole feature that is. If not, don't. My expectation is that most failovers are serious ones, that the primary system is down and not coming back very fast. Your worries seem to come from a scenario where the primary system is still up but Postgres bounces/crashes, we can diagnose the cause of the crash, decide the crashed server is safe and then wish to recommence operations on it again as quickly as possible, where seconds count it doing so. Are failovers going to be common? Why? > Of course, since I'm not planning to tackle that problem in 8.4, If you change your mind, having it in 8.4 would be good. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Hello Emmanuel, Emmanuel Cecchet wrote: > What Bettina calls the Lock Phase in > http://www.cs.mcgill.ca/~kemme/papers/vldb00.pdf is actually a > certification. Aha. Hm.. that has gone since Postgres-R (SI) and doesn't exist anymore in my current version either (so far called Postgres-R (8)). Most of what the certifier does (ordering of write sets) is handled by the GCS, everything else (i.e. what Tashkent refers to as write-write conflicts) happens within the database system itself using MVCC. > You can find more references to certification protocols in > http://gorda.di.uminho.pt/download/reports/gapi.pdf Thank you for that pointer. Seems like the term "certify" irritated me, because that's much more tied to public key encryption and such in my mind. > I would also recommend the work of Sameh on Tashkent and Taskent+ that > was based on Postgres: Thanks again. I've read the first one, which confirmed that I'm on the right track with what I'm doing with Postgres-R (8). I'm preparing to relive the single replicas from (most of) the WAL logging and instead apply separate change- or write-set logging. That seems to be the main achievement of Tashkent. Its savings are pretty obvious, IMO, because it heavily reduces the overall amount of I/O operations. >> What do you mean by "reliability issues"? >> > These approaches usually require an atomic broadcast primitive that is > usually fragile (limited scalability, hard to tune failure timeouts, ). I didn't have much reliability issues with ensemble, appia or spread, so far. Although, I admit I didn't ever run any of these in production. Performance is certainly an issue, yes. > Most prototype implementations have the load balancer and/or the > certifier as a SPOF (single point of failure). Building reliability for > these components will come with a significant performance penalty. That's a point, yeah. There's alway a compromise between performance and reliability. And more often than not, the third aspect to complicate the matter even further is: cost. >> What limitations are you speaking of here? > > Oftentimes DDL support is very limited. Agreed. My Postgres-R versions doesn't support any of those, yet. BTW, that's one of the cases where (fully) synchronous replication is more efficient, because DDL commands very often conflict with other transactions, it's better to use pessimistic locking. > Non-transactional objects like > sequences are not captured. Postgres-R (8) partly covers sequences already. It uses atomic broadcasts (independent from change set collection or multi-casting). An optional per node caching of sequence numbers helps reducing network latency for sequence increments. > Session or environment variables are not necessarily propagated. Support > of temp tables varies between databases which makes it hard to support > them properly in a generic way. > Well I guess everyone has a story on some limitations it has found with > some database replication technology especially when a user expects a > cluster to behave like a single database instance. Certainly, yes. > Happy holidays, Thanks, same to you! Regards Markus Wanner
Simon Riggs wrote: > You scare me that you see failover as sufficiently frequent that you are > worried that being without one of the servers for an extra 60 seconds > during a failover is a problem. And then say you're not going to add the > feature after all. I really don't understand. If its important, add the > feature, the whole feature that is. If not, don't. > > My expectation is that most failovers are serious ones, that the primary > system is down and not coming back very fast. Your worries seem to come > from a scenario where the primary system is still up but Postgres > bounces/crashes, we can diagnose the cause of the crash, decide the > crashed server is safe and then wish to recommence operations on it > again as quickly as possible, where seconds count it doing so. > > Are failovers going to be common? Why? > Hi Simon: I agree with most of your criticism to the "fail over only approach" - but don't agree that fail over frequency should really impact expectations for the failed system to return to service. I see "soft" fails (*not* serious) to potentially be common - somewhere on the network, something went down or some packet was lost, and the system took a few too many seconds to respond. My expectation is that the system can quickly detect that the node is out of service, be removed from the pool, when the situation is resolved (often automatically outside of my control) automatically "catch up" and be put back into the pool. Having to run some other process such as rsync seems unreliable as we already have a mechanism for streaming the data. All that is missing is streaming from an earlier point in time to catch up efficiently and reliably. I think I'm talking more about the complete solution though which is in line with what you are saying? :-) Cheers, mark -- Mark Mielke <mark@mielke.cc>
Hi, On Wed, Dec 24, 2008 at 12:38 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > Perhaps, but why do you say that? Since you often pointed out that getting backup is not problem because of incremental backup (e.g. rsync), I just thought so. > I've not blocked you from adding > anything useful to Postgres. Yes, I see. > You scare me that you see failover as sufficiently frequent that you are > worried that being without one of the servers for an extra 60 seconds > during a failover is a problem. And then say you're not going to add the > feature after all. I really don't understand. If its important, add the > feature, the whole feature that is. If not, don't. Oh, sorry. I don't want to scare you ;) But, yes, it's important. We should rethink the question? "Why does the failed server always need a fresh backup?" Though we discussed it previously and concluded that it should be done next time. http://archives.postgresql.org/pgsql-hackers/2008-11/msg01612.php > My expectation is that most failovers are serious ones, that the primary > system is down and not coming back very fast. Your worries seem to come > from a scenario where the primary system is still up but Postgres > bounces/crashes, we can diagnose the cause of the crash, decide the > crashed server is safe and then wish to recommence operations on it > again as quickly as possible, where seconds count it doing so. > > Are failovers going to be common? Why? As you say, *all* failovers are not serious ones. I think that a user would choose most convenient restarting method according to his or her situation (come back immediately? need careful diagnosis?). Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, 2008-12-24 at 02:23 +0900, Fujii Masao wrote: > Oh, sorry. I don't want to scare you ;) But, yes, it's important. We should > rethink the question? "Why does the failed server always need a fresh > backup?" Though we discussed it previously and concluded that it should > be done next time. > http://archives.postgresql.org/pgsql-hackers/2008-11/msg01612.php We might ask why pg_start_backup() needs to perform checkpoint though, since you have remarked that is a problem also. The answer is that it doesn't really need to, we just need to be certain that archiving has been running since whenever we choose as the start time. So we could easily just use the last normal checkpoint time, as long as we had some way of tracking the archiving. ISTM we can solve the checkpoint problem more easily and it would potentially save much more time than "tuning rsync for Postgres", which is what the other idea amounted to. So I do see a solution that is both better and more quickly achievable for 8.4. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Hi, On Wed, Dec 24, 2008 at 2:37 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Wed, 2008-12-24 at 02:23 +0900, Fujii Masao wrote: > >> Oh, sorry. I don't want to scare you ;) But, yes, it's important. We should >> rethink the question? "Why does the failed server always need a fresh >> backup?" Though we discussed it previously and concluded that it should >> be done next time. >> http://archives.postgresql.org/pgsql-hackers/2008-11/msg01612.php > > We might ask why pg_start_backup() needs to perform checkpoint though, > since you have remarked that is a problem also. > > The answer is that it doesn't really need to, we just need to be certain > that archiving has been running since whenever we choose as the start > time. So we could easily just use the last normal checkpoint time, as > long as we had some way of tracking the archiving. > > ISTM we can solve the checkpoint problem more easily and it would > potentially save much more time than "tuning rsync for Postgres", which > is what the other idea amounted to. So I do see a solution that is both > better and more quickly achievable for 8.4. Sounds good. I agree that pg_start_backup basically doesn't need checkpoint. But, for full_page_write == off, we probably cannot get rid of it. Even if full_page_write == on, since we cannot make out whether all indispensable full pages were written after last checkpoint, pg_start_backup must do checkpoint with "forcePageWrite = on". Problem is that online backup itself is unsafe. Even if there is no disk failure (i.e. normal case), we can easily produce a partial write in online backup. So, we always need full pages when recovering online backup, then pg_start_backup always needs checkpoint with forcePageWrite = on. I think that we probably have to track the history of full_page_write, in order to get rid of checkpoint from pg_start_backup. On the other hand, the data after crash other than media crash is "safe". Currently, we can recover it without full page write as simple crash recovery case. I think that we can use it also for archive recovery, because there isn't really any distinction between both. I've not found the corner case yet. Do you have? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Hi, On Mon, Dec 22, 2008 at 1:29 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > Not so simple. > > At least the primary has to additionally maintain the byte position the standby > has already fsynced. The main difference from the current patch is whether > the standby fsyncs the logfile when it fills even if you don't choose #4(fsync). > In order to prevent from having to go back and re-open prior logfiles when an > fsync request comes along later, we would need to ignore the sync mode and > make the standby fsync the logfile when it fills. This would degrade the > performance periodically. Is this acceptable? > > I think there are four choices. Which do you prefer? > > 1) Accept the above change. > 2) Go back and re-open prior logfiles when a fsync request comes along. > 3) Stop the sync control by the primary and leave it to the standby. > 4) Add new option to specify whether to permit optimistic fsync, this option > makes the standby fsync only the current logfile when a fsync request > comes along (don't go back and re-open prior logfiles). > > 2) would cause another performance degradation. 4) would furthermore > confuse users about setting a sync mode. So, I prefer 3) though I'm sorry > for digging up the discussion about transaction control. Please feel free > to comment! 5) Only allow optimistic fsync I'm going to adopt 5) for next patch at least for a while. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, 2008-12-24 at 11:39 +0900, Fujii Masao wrote: > > We might ask why pg_start_backup() needs to perform checkpoint though, > > since you have remarked that is a problem also. > > > > The answer is that it doesn't really need to, we just need to be certain > > that archiving has been running since whenever we choose as the start > > time. So we could easily just use the last normal checkpoint time, as > > long as we had some way of tracking the archiving. > > > > ISTM we can solve the checkpoint problem more easily and it would > > potentially save much more time than "tuning rsync for Postgres", which > > is what the other idea amounted to. So I do see a solution that is both > > better and more quickly achievable for 8.4. > > Sounds good. I agree that pg_start_backup basically doesn't need > checkpoint. But, for full_page_write == off, we probably cannot get > rid of it. Even if full_page_write == on, since we cannot make out > whether all indispensable full pages were written after last checkpoint, > pg_start_backup must do checkpoint with "forcePageWrite = on". Yes, OK. So I think it would only work when full_page_writes = on, and has been on since last checkpoint. So two changes: * We just need a boolean that starts at true every checkpoint and gets set to false anytime someone resets full_page_writes or archive_command. If the flag is set && full_page_writes = on then we skip the checkpoint entirely and use the value from the last checkpoint. * My "infra" patch also had a modified version of pg_start_backup() that allowed you to specify IMMEDIATE checkpoint or not. Reworking that seems a waste of time, and I want to listen to everybody else now and change pg_start_backup() so it throws an IMMEDIATE CHECKPOINT and leave it there. Can you work on those also? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Hi, On Wed, Dec 24, 2008 at 6:57 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > Yes, OK. So I think it would only work when full_page_writes = on, and > has been on since last checkpoint. So two changes: > > * We just need a boolean that starts at true every checkpoint and gets > set to false anytime someone resets full_page_writes or archive_command. > If the flag is set && full_page_writes = on then we skip the checkpoint > entirely and use the value from the last checkpoint. Sounds good. pg_start_backup on the standby (probably you are planning?) also needs this logic? If so, resetting full_page_writes or archive_command should generate its xlog. I have another thought: should we forbid the reset of archive_command during online backup? Currently we can do. If we don't need to do so, we also don't need to track the reset of archiving for fast pg_start_backup. > > * My "infra" patch also had a modified version of pg_start_backup() that > allowed you to specify IMMEDIATE checkpoint or not. Reworking that seems > a waste of time, and I want to listen to everybody else now and change > pg_start_backup() so it throws an IMMEDIATE CHECKPOINT and leave it > there. > > Can you work on those also? Umm.. I'm busy. Of course, I will try it if no one raises his or her hand. But, I'd like to put coding the core of synch rep ahead of this. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Hi, On Wed, Dec 24, 2008 at 7:58 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > Hi, > > On Wed, Dec 24, 2008 at 6:57 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> Yes, OK. So I think it would only work when full_page_writes = on, and >> has been on since last checkpoint. So two changes: >> >> * We just need a boolean that starts at true every checkpoint and gets >> set to false anytime someone resets full_page_writes or archive_command. >> If the flag is set && full_page_writes = on then we skip the checkpoint >> entirely and use the value from the last checkpoint. > > Sounds good. I attached the self-contained patch to skip checkpoint at pg_start_backup. > > pg_start_backup on the standby (probably you are planning?) also needs > this logic? If so, resetting full_page_writes or archive_command should > generate its xlog. Now, the patch doesn't care about this. > > I have another thought: should we forbid the reset of archive_command > during online backup? Currently we can do. If we don't need to do so, > we also don't need to track the reset of archiving for fast pg_start_backup. Now, doesn't care too. Happy Holidays! -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Attachment
On Thu, 2008-12-25 at 00:10 +0900, Fujii Masao wrote: > Hi, > > On Wed, Dec 24, 2008 at 7:58 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > > Hi, > > > > On Wed, Dec 24, 2008 at 6:57 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > >> Yes, OK. So I think it would only work when full_page_writes = on, and > >> has been on since last checkpoint. So two changes: > >> > >> * We just need a boolean that starts at true every checkpoint and gets > >> set to false anytime someone resets full_page_writes or archive_command. > >> If the flag is set && full_page_writes = on then we skip the checkpoint > >> entirely and use the value from the last checkpoint. > > > > Sounds good. > > I attached the self-contained patch to skip checkpoint at pg_start_backup. Good. Can we change to IMMEDIATE when it we need the checkpoint? What is bkpCount for? I think we should discuss whatever that is for separately. It isn't used in any if test, AFAICS. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Hi Markus, > I didn't have much reliability issues with ensemble, appia or spread, so > far. Although, I admit I didn't ever run any of these in production. > Performance is certainly an issue, yes. > I may suggest another reading even though a bit dates, most of the results still apply: http://jmob.objectweb.org/jgroups/JGroups-middleware-2004.pdf The baseline is that if you use UDP multicast, you need a dedicated switch and the tuning is a nightmare. I discussed these issues with the developers of Spread and they have no real magic. TCP seems a more reliable alternative (especially predictable performance) but the TCP timeouts are also tricky to tune depending on the platform. We worked quite a bit with Nuno around Appia in the context of Sequoia and performance can be outstanding when properly tuned or absolutely awful is some default values are wrong. The chaotic behavior of GCS under stress quickly compromises the reliability of the replication system, and admission control on UDP multicast has no good solution so far. It's just a heads up on what is awaiting you in production when the system is stressed. There is no good solution so far besides a good admission control on top of the GCS (in the application). I am now off for the holidays. Cheers, Emmanuel -- Emmanuel Cecchet Aster Data Systems Web: http://www.asterdata.com
Hi, I fixed some bugs. On Thu, Dec 25, 2008 at 12:31 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > Can we change to IMMEDIATE when it we need the checkpoint? Perhaps yes, though current patch doesn't care about it. I'm not sure if we really need the feature. Yes, as you say, I'd like to also listen to everybody else. > > What is bkpCount for? So far, name of a backup history file consists of only checkpoint redo location. But, in this patch, since some backups use the same checkpoint, a backup history file could be overwritten unfortunately. So, I introduced bkpCount as ID of backups which use the same checkpoint. > I think we should discuss whatever that is for > separately. It isn't used in any if test, AFAICS. Yes, this patch is testbed. We need to discuss more. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center