Thread: Synchronous Log Shipping Replication
Hi, In PGCon 2008, I proposed synchronous log shipping replication. Sorry for late posting, but I'd like to start the discussion about its implementation from now. http://www.pgcon.org/2008/schedule/track/Horizontal%20Scaling/76.en.html First of all, I'm not planning to put the prototype which I demoed in PGCon into core directly. - Portability issues (using message queue, multi-threaded ...) - Have too much dependency on Heartbeat Yes, since the prototype is useful reference of implementation, I plan to open it ASAP. But, I'm sorry - it still takes a month to open it. Pavan re-designed the sync replication based on the prototype and I posted that design doc on wiki. Please check it if you are interested in it. http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects This design is too huge. In order to enhance the extensibility of postgres, I'd like to divide the sync replication into minimum hooks and some plugins and to develop it, respectively. Plugins for the sync replication plan to be available at the time of 8.4 release. In my design, WAL sending is achieved as follow by WALSender. WALSender is a new process which I introduce. 1) On COMMIT, backend requests WALSender to send WAL. 2) WALSender reads WAL from walbuffers and send it to slave. 3) WALSenderwaits for the response from slave and replies backend. I propose two hooks for WAL sending. WAL-writing hook ---------------- This hook is for backend to communicate with WALSender. WAL-writing hook intercepts write system call in XLogWrite. That is, backend requests WAL sending whenever write is called. WAL-writing hook is available also for other uses e.g. Software RAID (writes WAL into two files for durability). Hook for WALSender ------------------ This hook is for introducing WALSender. There are the following three ideas of how to introduce WALSender. A required hook differs by which idea is adopted. a) Use WALWriter as WALSender This idea needs WALWriter hook which intercepts WALWriter literally. WALWriter stops the local WAL write and focuses on WAL sending. This idea is very simple, but I don't think of the use of WALWriter hook other than WAL sending. b) Use new background process as WALSender This idea needs background-process hook which enables users to define new background processes. I think the design ofthis hook resembles that of rmgr hook proposed by Simon. I define the table like RmgrTable. It's for registering somefunctions (e.g. main function and exit...) for operating a background process. Postmaster calls the function from thetable suitably, and manages a start and end of background process. ISTM that there are many uses in this hook, e.g.performance monitoring process like statspack. c) Use one backend as WALSender In this idea, slave calls the user-defined function which takes charge of WAL sending via SQL e.g. "SELECT pg_walsender()". Compared with other ideas, it's easy to implement WALSender because postmater handles the establishmentand authentication of connection. But, this SQL causes a long transaction which prevents vacuum. So, thisidea needs idle-state hook which executes plugin before transaction starts. I don't think of the use of this hook otherthan WAL sending either. Which idea should we adopt? Comments welcome. -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Hi, Fujii Masao wrote: > Pavan re-designed the sync replication based on the prototype > and I posted that design doc on wiki. Please check it if you > are interested in it. > http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects I've read that wiki page and allow myself to comment from a Postgres-R developer's perspective ;-) R1: "without ... any negative performance overhead"? For fully synchronous replication, that's clearly not possible. I guess that applies only for async WAL shipping. NR3: who is supposed to do failure detection and manage automatic failover? How does integration with such an additional tool work? I got distracted by the SBY and ACT abbreviations. Why abbreviate standby or active at all? It's not like we don't already have enough three letter acronyms, but those stand for rather more complex terms than single words. Standby Bootstrap: "stopping the archiving at the ACT" doesn't prevent overriding WAL files in pg_xlog. It just stops archiving a WAL file before it gets overridden - which clearly doesn't solve the problem here. How is communication done? "Serialization of WAL shipping" should better not mean serialization on the network, i.e. the WAL Sender Process should be able to await acknowledgment of multiple WAL packets in parallel, otherwise the interconnect latency might turn into a bottleneck. How is communication done? What happens if the link between the active and standby goes down? Or if it's temporarily unavailable for some time? The IPC mechanism reminds me a lot of what I did for Postgres-R, which also has a central "replication manager" process, which receives changesets from multiple backends. I've implemented an internal messaging mechanism based on shared memory and signals, using only Postgres methods. It allows arbitrary processes to send messages to each other by process id. Moving the WAL Sender and WAL Receiver processes under the control of the postmaster certainly sounds like a good thing. After all, those are fiddling wiht Postgres internals. > This design is too huge. In order to enhance the extensibility > of postgres, I'd like to divide the sync replication into > minimum hooks and some plugins and to develop it, respectively. > Plugins for the sync replication plan to be available at the > time of 8.4 release. Hooks again? I bet you all know by now, that my excitement for hooks has always been pretty narrow. ;-) > In my design, WAL sending is achieved as follow by WALSender. > WALSender is a new process which I introduce. > > 1) On COMMIT, backend requests WALSender to send WAL. > 2) WALSender reads WAL from walbuffers and send it to slave. > 3) WALSender waits for the response from slave and replies > backend. > > I propose two hooks for WAL sending. > > WAL-writing hook > ---------------- > This hook is for backend to communicate with WALSender. > WAL-writing hook intercepts write system call in XLogWrite. > That is, backend requests WAL sending whenever write is called. > > WAL-writing hook is available also for other uses e.g. > Software RAID (writes WAL into two files for durability). > > Hook for WALSender > ------------------ > This hook is for introducing WALSender. There are the following > three ideas of how to introduce WALSender. A required hook > differs by which idea is adopted. > > a) Use WALWriter as WALSender > > This idea needs WALWriter hook which intercepts WALWriter > literally. WALWriter stops the local WAL write and focuses on > WAL sending. This idea is very simple, but I don't think of > the use of WALWriter hook other than WAL sending. > > b) Use new background process as WALSender > > This idea needs background-process hook which enables users > to define new background processes. I think the design of this > hook resembles that of rmgr hook proposed by Simon. I define > the table like RmgrTable. It's for registering some functions > (e.g. main function and exit...) for operating a background > process. Postmaster calls the function from the table suitably, > and manages a start and end of background process. ISTM that > there are many uses in this hook, e.g. performance monitoring > process like statspack. > > c) Use one backend as WALSender > > In this idea, slave calls the user-defined function which > takes charge of WAL sending via SQL e.g. "SELECT pg_walsender()". > Compared with other ideas, it's easy to implement WALSender > because postmater handles the establishment and authentication > of connection. But, this SQL causes a long transaction which > prevents vacuum. So, this idea needs idle-state hook which > executes plugin before transaction starts. I don't think of > the use of this hook other than WAL sending either. The above cited wiki page sounds like you've already decided for b). I'm unclear on what you want hooks for. If additional processes get integrated into Postgres, those certainly need to get integrated very much like we integrated other auxiliary processes. I wouldn't call that 'hooking', but YMMV. Regards Markus Wanner
On Fri, 2008-09-05 at 23:21 +0900, Fujii Masao wrote: > Pavan re-designed the sync replication based on the prototype > and I posted that design doc on wiki. Please check it if you > are interested in it. > http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects It's good to see the detailed design, many thanks. I will begin looking at technical details next week. > This design is too huge. In order to enhance the extensibility > of postgres, I'd like to divide the sync replication into > minimum hooks and some plugins and to develop it, respectively. > Plugins for the sync replication plan to be available at the > time of 8.4 release. What is Core's commentary on this plan? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Markus Wanner wrote: > > Hook for WALSender > > ------------------ > > This hook is for introducing WALSender. There are the following > > three ideas of how to introduce WALSender. A required hook > > differs by which idea is adopted. > > > > a) Use WALWriter as WALSender > > > > This idea needs WALWriter hook which intercepts WALWriter > > literally. WALWriter stops the local WAL write and focuses on > > WAL sending. This idea is very simple, but I don't think of > > the use of WALWriter hook other than WAL sending. The problem with this approach is that you are not sending WAL to the disk _while_ you are, in parallel, sending WAL to the slave; I think this is useful for performance reasons in synrchonous replication. > > b) Use new background process as WALSender > > > > This idea needs background-process hook which enables users > > to define new background processes. I think the design of this > > hook resembles that of rmgr hook proposed by Simon. I define > > the table like RmgrTable. It's for registering some functions > > (e.g. main function and exit...) for operating a background > > process. Postmaster calls the function from the table suitably, > > and manages a start and end of background process. ISTM that > > there are many uses in this hook, e.g. performance monitoring > > process like statspack. I think starting/stopping a process for each WAL send is too much overhead. > > c) Use one backend as WALSender > > > > In this idea, slave calls the user-defined function which > > takes charge of WAL sending via SQL e.g. "SELECT pg_walsender()". > > Compared with other ideas, it's easy to implement WALSender > > because postmater handles the establishment and authentication > > of connection. But, this SQL causes a long transaction which > > prevents vacuum. So, this idea needs idle-state hook which > > executes plugin before transaction starts. I don't think of > > the use of this hook other than WAL sending either. > > The above cited wiki page sounds like you've already decided for b). I assumed that there would be a background process like bgwriter that would be notified during a commit and send the appropriate WAL files to the slave. > I'm unclear on what you want hooks for. If additional processes get > integrated into Postgres, those certainly need to get integrated very > much like we integrated other auxiliary processes. I wouldn't call that > 'hooking', but YMMV. Yea, I am unclear how this is going to work using simple hooks. It sounds like Fujii-san is basically saying they can only get the hooks done for 8.4, not the actual solution. But, as I said above, I am unclear how a hook solution would even work long-term; I am afraid it would be thrown away once an integrated solution was developed. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Sat, 2008-09-06 at 22:09 -0400, Bruce Momjian wrote: > Markus Wanner wrote: > > > Hook for WALSender > > > ------------------ > > > This hook is for introducing WALSender. There are the following > > > three ideas of how to introduce WALSender. A required hook > > > differs by which idea is adopted. > > > > > > a) Use WALWriter as WALSender > > > > > > This idea needs WALWriter hook which intercepts WALWriter > > > literally. WALWriter stops the local WAL write and focuses on > > > WAL sending. This idea is very simple, but I don't think of > > > the use of WALWriter hook other than WAL sending. > > The problem with this approach is that you are not sending WAL to the > disk _while_ you are, in parallel, sending WAL to the slave; I think > this is useful for performance reasons in synrchonous replication. Agreed > > > b) Use new background process as WALSender > > > > > > This idea needs background-process hook which enables users > > > to define new background processes. I think the design of this > > > hook resembles that of rmgr hook proposed by Simon. I define > > > the table like RmgrTable. It's for registering some functions > > > (e.g. main function and exit...) for operating a background > > > process. Postmaster calls the function from the table suitably, > > > and manages a start and end of background process. ISTM that > > > there are many uses in this hook, e.g. performance monitoring > > > process like statspack. > > I think starting/stopping a process for each WAL send is too much > overhead. I would agree with that, but I don't think that was being suggested was it? See later. > > > c) Use one backend as WALSender > > > > > > In this idea, slave calls the user-defined function which > > > takes charge of WAL sending via SQL e.g. "SELECT pg_walsender()". > > > Compared with other ideas, it's easy to implement WALSender > > > because postmater handles the establishment and authentication > > > of connection. But, this SQL causes a long transaction which > > > prevents vacuum. So, this idea needs idle-state hook which > > > executes plugin before transaction starts. I don't think of > > > the use of this hook other than WAL sending either. > > > > The above cited wiki page sounds like you've already decided for b). > > I assumed that there would be a background process like bgwriter that > would be notified during a commit and send the appropriate WAL files to > the slave. ISTM that this last paragraph is actually what was meant by option b). I think it would work the other way around though, the WALSender would send continuously and backends may choose to wait for it to reach a certain LSN, or not. WALWriter really should work this way too. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Fri, 2008-09-05 at 23:21 +0900, Fujii Masao wrote: > b) Use new background process as WALSender > > This idea needs background-process hook which enables users > to define new background processes. I think the design of this > hook resembles that of rmgr hook proposed by Simon. I define > the table like RmgrTable. It's for registering some functions > (e.g. main function and exit...) for operating a background > process. Postmaster calls the function from the table suitably, > and manages a start and end of background process. ISTM that > there are many uses in this hook, e.g. performance monitoring > process like statspack. Sorry, but the comparison with the rmgr hook is mistaken. The rmgr hook exists only within the Startup process and I go to some lengths to ensure it is never called in normal backends. So it has got absolutely nothing to do with generating WAL messages (existing/new/modified) or sending them since it doesn't even exist during normal processing. The intention of the rmgr hook is to allow WAL messages to be manipulated in new ways in recovery mode. It isn't a sufficient change to implement replication, and the functionality is orthogonal to streaming WAL replication. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Sat, 2008-09-06 at 22:09 -0400, Bruce Momjian wrote: > > I'm unclear on what you want hooks for. If additional processes get > > integrated into Postgres, those certainly need to get integrated very > > much like we integrated other auxiliary processes. I wouldn't call that > > 'hooking', but YMMV. > > Yea, I am unclear how this is going to work using simple hooks. > > It sounds like Fujii-san is basically saying they can only get the hooks > done for 8.4, not the actual solution. But, as I said above, I am > unclear how a hook solution would even work long-term; I am afraid it > would be thrown away once an integrated solution was developed. It will be interesting to have various hooks in streaming WAL code to implement various additional features for enterprise integration. But that doesn't mean I support hooks in every/all places. For me, the proposed hook amounts to "we've only got time to implement 2/3 of the required features, so we'd like to circumvent the release cycle by putting in a hook and providing the code later". For me, hooks are for adding additional features, not for making up for the lack of completed code. It's kinda hard to say "we now have WAL streaming" without the streaming bit. We need either a fully working WAL streaming feature, or we wait until next release. We probably need to ask if there is anybody willing to complete the middle part of this feature so we can get it into 8.4. It would be sensible to share the code we have now, so we can see what remains to be implemented. I just committed to delivering Hot Standby for 8.4, so I can't now get involved to deliver this code. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Bruce Momjian <bruce@momjian.us> wrote: > > > b) Use new background process as WALSender > > > > > > This idea needs background-process hook which enables users > > > to define new background processes > I think starting/stopping a process for each WAL send is too much > overhead. Yes, of course slow. But I guess it is the only way to share one socket in all backends. Postgres is not a multi-threaded architecture, so each backend should use dedicated connections to send WAL buffers. 300 backends require 300 connections for each slave... it's not good at all. > It sounds like Fujii-san is basically saying they can only get the hooks > done for 8.4, not the actual solution. No! He has an actual solution in his prototype ;-) It is very similar to b) and the overhead was not so bad. It's not so clean to be a part of postgres, though. Are there any better idea to share one socket connection between backends (and bgwriter)? The connections could be established after fork() from postmaster, and number of them could be two or more. This is one of the most complicated part of synchronous log shipping. Switching-processes apporach like b) is just one idea for it. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Hi, ITAGAKI Takahiro wrote: > Are there any better idea to share one socket connection between > backends (and bgwriter)? The connections could be established after > fork() from postmaster, and number of them could be two or more. > This is one of the most complicated part of synchronous log shipping. > Switching-processes apporach like b) is just one idea for it. I fear I'm repeating myself, but I've had the same problem for Postgres-R and solved it with an internal message passing infrastructure which I've simply called imessages. It requires only standard Postgres shared memory, signals and locking and should thus be pretty portable. In simple benchmarks, it's not quite as efficient as unix pipes, but doesn't require as many file descriptors, is independent of the parent-child relations of processes, maintains message borders and it is more portable (I hope). It could certainly be improved WRT efficiency and could theoretically even beat Unix pipes, because it involves less copying of data and less syscalls. It has not been reviewed nor commented much. I'd still appreciate that. Regards Markus Wanner
Markus Wanner <markus@bluegap.ch> wrote: > ITAGAKI Takahiro wrote: > > Are there any better idea to share one socket connection between > > backends (and bgwriter)? > > I fear I'm repeating myself, but I've had the same problem for > Postgres-R and solved it with an internal message passing infrastructure > which I've simply called imessages. It requires only standard Postgres > shared memory, signals and locking and should thus be pretty portable. Imessage serves as a useful reference, but it is one of the detail parts of the issue. I can break down the issue into three parts: 1. Is process-switching approach the best way to share one socket? Both Postgres-R and the log-shipping prototypeuse the approach now. Can I think there is no objection here? 2. If 1 is reasonable, how should we add a new WAL sender process? Just add a new process using a core-patch? Merge into WAL writer? Consider framework to add any of user-defined auxiliary process? 3. If 1 is reasonable, what should we use for the process-switching primitive? Postgres-R uses signals and lockingand the log-shipping prototype uses multi-threads and POSIX message queues now. Signals and locking is possible choice for 3, but I want to use better approach if any. Faster is always better. I guess we could invent a new semaphore-like primitive at the same layer as LWLocks using spinlock and PGPROC directly... Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Hi, ITAGAKI Takahiro wrote: > 1. Is process-switching approach the best way to share one socket? > Both Postgres-R and the log-shipping prototype use the approach now. > Can I think there is no objection here? I don't see any appealing alternative. The postmaster certainly shouldn't need to worry with any such socket for replication. Threading falls pretty flat for Postgres. So the socket must be held by one of the child processes of the Postmaster. > 2. If 1 is reasonable, how should we add a new WAL sender process? > Just add a new process using a core-patch? Seems feasible to me, yes. > Merge into WAL writer? Uh.. that would mean you'd loose parallelism between WAL writing to disk and WAL shipping via network. That does not sound appealing to me. > Consider framework to add any of user-defined auxiliary process? What for? What do you miss in the existing framework? > 3. If 1 is reasonable, what should we use for the process-switching > primitive? > Postgres-R uses signals and locking and the log-shipping prototype > uses multi-threads and POSIX message queues now. AFAIK message queues are problematic WRT portability. At least Postgres doesn't currently use them and introducing dependencies on those might lead to problems, but I'm not sure. Others certainly know more about issues involved. A multi-threaded approach is certainly out of bounds, at least within the Postgres core code. > Signals and locking is possible choice for 3, but I want to use better > approach if any. Faster is always better. I think the approach can reach better throughput than POSIX message queues or unix pipes, because of the mentioned savings in copying around between system and application memory. However, that hasn't been proved, yet. > I guess we could invent a new semaphore-like primitive at the same layer > as LWLocks using spinlock and PGPROC directly... Sure, but in what way would that differ from what I do with imessages? Regards Markus Wanner
On Mon, Sep 8, 2008 at 8:44 PM, Markus Wanner <markus@bluegap.ch> wrote: >> Merge into WAL writer? > > Uh.. that would mean you'd loose parallelism between WAL writing to disk and > WAL shipping via network. That does not sound appealing to me. That depends on the order of WAL writing and WAL shipping. How about the following order? 1. A backend writes WAL to disk. 2. The backend wakes up WAL sender process and sleeps. 3. WAL sender process does WAL shipping and wakes up the backend. 4. The backend issues sync command. >> I guess we could invent a new semaphore-like primitive at the same layer >> as LWLocks using spinlock and PGPROC directly... > > Sure, but in what way would that differ from what I do with imessages? Performance ;) The timing of the process's receiving a signal is dependent on the scheduler of kernel. The scheduler does not always handle a signal immediately. Regards -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Hi, Fujii Masao wrote: > 1. A backend writes WAL to disk. > 2. The backend wakes up WAL sender process and sleeps. > 3. WAL sender process does WAL shipping and wakes up the backend. > 4. The backend issues sync command. Right, that would work. But still, the WAL writer process would block during writing WAL blocks. Are there compelling reasons for using the existing WAL writer process, as opposed to introducing a new process? > The timing of the process's receiving a signal is dependent on the scheduler > of kernel. Sure, so are pipes or shmem queues. > The scheduler does not always handle a signal immediately. What exactly are you proposing to use instead of signals? Semaphores are pretty inconvenient when trying to wake up arbitrary processes or in conjunction with listening on sockets via select(), for example. See src/backend/replication/manager.c from Postgres-R for a working implementation of such a process using select() and signaling. Regards Markus Wanner
On Mon, 2008-09-08 at 19:19 +0900, ITAGAKI Takahiro wrote: > Bruce Momjian <bruce@momjian.us> wrote: > > > > > b) Use new background process as WALSender > > > > > > > > This idea needs background-process hook which enables users > > > > to define new background processes > > > I think starting/stopping a process for each WAL send is too much > > overhead. > > Yes, of course slow. But I guess it is the only way to share one socket > in all backends. Postgres is not a multi-threaded architecture, > so each backend should use dedicated connections to send WAL buffers. > 300 backends require 300 connections for each slave... it's not good at all. So... don't have individual backends do the sending. Have them wait while somebody else does it for them. > > It sounds like Fujii-san is basically saying they can only get the hooks > > done for 8.4, not the actual solution. > > No! He has an actual solution in his prototype ;-) The usual thing if you have a WIP patch you're not sure of is to post the patch for feedback. If you guys aren't going to post any code to the project then I'm not clear why it's being discussed here. Is this a community project or a private project? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Fujii Masao wrote: > On Mon, Sep 8, 2008 at 8:44 PM, Markus Wanner <markus@bluegap.ch> wrote: > >> Merge into WAL writer? > > > > Uh.. that would mean you'd loose parallelism between WAL writing to disk and > > WAL shipping via network. That does not sound appealing to me. > > That depends on the order of WAL writing and WAL shipping. > How about the following order? > > 1. A backend writes WAL to disk. > 2. The backend wakes up WAL sender process and sleeps. > 3. WAL sender process does WAL shipping and wakes up the backend. > 4. The backend issues sync command. I am confused why this is considered so complicated. Having individual backends doing the wal transfer to the slave is never going to work well. I figured we would have a single WAL streamer that continues advancing forward in the WAL file, streaming to the standby. Backends would update a shared memory variable specifying how far they want the wal streamer to advance and send a signal to the wal streamer if necessary. Backends would monitor another shared memory variable that specifies how far the wal streamer has advanced. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Hi, Bruce Momjian wrote: > Backends would > update a shared memory variable specifying how far they want the wal > streamer to advance and send a signal to the wal streamer if necessary. > Backends would monitor another shared memory variable that specifies how > far the wal streamer has advanced. That sounds like WAL needs to be written to disk, before it can be sent to the standby. Except maybe with some sort of mmap'ing the WAL. Regards Markus Wanner
Markus Wanner wrote: > Hi, > > Bruce Momjian wrote: > > Backends would > > update a shared memory variable specifying how far they want the wal > > streamer to advance and send a signal to the wal streamer if necessary. > > Backends would monitor another shared memory variable that specifies how > > far the wal streamer has advanced. > > That sounds like WAL needs to be written to disk, before it can be sent > to the standby. Except maybe with some sort of mmap'ing the WAL. Well, WAL is either on disk or in the wal_buffers in shared memory --- in either case, a WAL streamer can get to it. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Hi, I looked some comment for the synchronous replication and understood the consensus of the community was that the sync replication should be added using not hooks and plug-ins but core-patches. If my understanding is right, I will change my development plan so that the sync replication may be put into core. But, I don't think every features should be put into core. Of course, the high-availability features (like clustering, automatic failover, ...etc) are out of postgres. The user who wants whole HA solution using the sync replication must integrate postgres and clustering software like heartbeat. WAL sending should be put into core. But, I'd like to separate WAL receiving from core and provide it as a new contrib tool. Because, there are some users who use the sync replication as only WAL streaming. They don't want to start postgres on the slave. Of course, the slave can replay WAL by using pg_standby and WAL receiver tool which I'd like to provide as a new contrib tool. I think the patch against recovery code is not necessary. I arrange the development items below : 1) Patch around XLogWrite. It enables a backend to wake up the WAL sender process at the timing of COMMIT. 2) Patch for the communication between a backend and WAL sender process. There were some discussions about this topic.Now, I decided to adopt imessages proposed by Markus. 3) Patch of introducing new background process which I've called WALSender. It takes charge of sending WAL to the slave. Now, I assume that WALSender also listens the connection from the slave, i.e. only one sender process manages multipleslaves. The relation between WALSender and backend is 1:1. So, the communication mechanism between them canbe simple. As other idea, I can introduce new listener process and fork new WALSender for every slave. Which architectureis better? Or, should postmaster listen also the connection from the slave? 4) New contrib tool which I've called WALReceiver. It takes charge of receiving WAL from the master and writing it to diskon the slave. I will submit these patches and tool by Nov Commit Fest at the latest. Any comment welcome! best regards -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
"Fujii Masao" <masao.fujii@gmail.com> wrote: > 3) Patch of introducing new background process which I've called > WALSender. It takes charge of sending WAL to the slave. > > Now, I assume that WALSender also listens the connection from > the slave, i.e. only one sender process manages multiple slaves. > The relation between WALSender and backend is 1:1. So, > the communication mechanism between them can be simple. I assume that he says only one backend communicates with WAL sender at a time. The communication is done during WALWriteLock is held, so other backends wait for the communicating backend on WALWriteLock. WAL sender only needs to send one signal for each time it sends WAL buffers to slave. We could be split the LWLock to WALWriterLock and WALSenderLock, but the essential point is same. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
On Mon, 2008-09-08 at 17:40 -0400, Bruce Momjian wrote: > Fujii Masao wrote: > > On Mon, Sep 8, 2008 at 8:44 PM, Markus Wanner <markus@bluegap.ch> wrote: > > >> Merge into WAL writer? > > > > > > Uh.. that would mean you'd loose parallelism between WAL writing to disk and > > > WAL shipping via network. That does not sound appealing to me. > > > > That depends on the order of WAL writing and WAL shipping. > > How about the following order? > > > > 1. A backend writes WAL to disk. > > 2. The backend wakes up WAL sender process and sleeps. > > 3. WAL sender process does WAL shipping and wakes up the backend. > > 4. The backend issues sync command. > > I am confused why this is considered so complicated. Having individual > backends doing the wal transfer to the slave is never going to work > well. Agreed. > I figured we would have a single WAL streamer that continues advancing > forward in the WAL file, streaming to the standby. Backends would > update a shared memory variable specifying how far they want the wal > streamer to advance and send a signal to the wal streamer if necessary. > Backends would monitor another shared memory variable that specifies how > far the wal streamer has advanced. Yes. We should have a LogwrtRqst pointer and LogwrtResult pointer for the send operation. The Write and Send operations can then continue independently of one another. XLogInsert() cannot advance to a new page while we are waiting to send or write. Notice that the Send process might be the bottleneck - that is the price of synchronous replication. Backends then wait * not at all for asynch commit * just for Write for local synch commit * for both Write and Send for remote synch commit (various additional options for what happens to confirm Send) So normal backends neither write nor send. We have two dedicated processes, one for write, one for send. We need to put an extra test into WALWriter loop so that it will continue immediately (with no wait) if there is an outstanding request for synchronous operation. This gives us the Group Commit feature also, even if we are not using replication. So we can drop the commit_delay stuff. XLogBackgroundFlush() processes data page at a time if it can. That may not be the correct batch size for XLogBackgroundSend(), so we may need a tunable for the MTU. Under heavy load we need the Write and Send to act in a way to maximise throughput rather than minimise response time, as we do now. If wal_buffers overflows, we continue to hold WALInsertLock while we wait for WALWriter and WALSender to complete. We should increase default wal_buffers to 64. After (or during) XLogInsert backends will sleep in a proc queue, similar to LWlocks and protected by a spinlock. When preparing to write/send the WAL process should read the proc at the *tail* of the queue to see what the next LogwrtRqst should be. Then it performs its action and wakes procs up starting with the head of the queue. We would add LSN into PGPROC, so WAL processes can check whether the backend should be woken. The LSN field can be accessed without spinlocks since it is only ever set by the backend itself and only read while a backend is sleeping. So we access spinlock, find tail, drop spinlock then read LSN of the backend that (was) the tail. Another thought occurs that we might measure the time a Send takes and specify a limit on how long we are prepared to wait for confirmation. Limit=0 => asynchronous. Limit > 0 implies synchronous-up-to-the-limit. This would give better user behaviour across a highly variable network connection. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > This gives us the Group Commit feature also, even if we are not using > replication. So we can drop the commit_delay stuff. Huh? How does that give us group commit? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, 2008-09-09 at 12:24 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > This gives us the Group Commit feature also, even if we are not using > > replication. So we can drop the commit_delay stuff. > > Huh? How does that give us group commit? Multiple backends waiting while we perform a write. Commits then happen as a group (to WAL at least), hence Group Commit. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > Multiple backends waiting while we perform a write. Commits then happen > as a group (to WAL at least), hence Group Commit. The problem with our current commit protocol is this: 1. Backend A inserts commit record A 2. Backend A starts to flush commit record A 3. Backend B inserts commit record B 4. Backend B waits until 2. finishes 5. Backend B starts to flush commit record B Note that we already have the logic to flush all pending commit records at once. If there's also backend C that insert their commit records after step 2, B and C will be flushed at once: 1. Backend A inserts commit record A 2. Backend A starts to flush commit record A 3. Backend B inserts commit record B 4. Backend B waits until 2. finishes 5. Backend C inserts commit record C 6. Backend C waits until 2. finishes 5. Flush A finishes. Backend B starts to flush commit records A+B The idea of group commit is to insert a small delay in backend A between steps 1 and 2, so that we can flush both commit records in one fsync: 1. Backend A inserts commit record A 2. Backend A waits 3. Backend B inserts commit record B 3. Backend B starts to flush commit record A + B The tricky part is, how does A know if it should wait, and for how long? commit_delay sure isn't ideal, but AFAICS the log shipping proposal doesn't provide any solution to that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > The tricky part is, how does A know if it should wait, and for how long? > commit_delay sure isn't ideal, but AFAICS the log shipping proposal > doesn't provide any solution to that. They have no relation each other directly, but they need similar synchronization modules. In log shipping, backends need to wait for WAL Sender's job, and should wake up as fast as possible after the job is done. It is similar to requirement of the group commit. Signals and locking, borrewed from Postgres-R, are now studied for the purpose in the log shipping, but I'm not sure it can be also used in the group commit. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
On Tue, Sep 9, 2008 at 5:11 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > Yes. We should have a LogwrtRqst pointer and LogwrtResult pointer for > the send operation. The Write and Send operations can then continue > independently of one another. XLogInsert() cannot advance to a new page > while we are waiting to send or write. Agreed. For realizing various synchronous options, the Write and Send operations should be treated separately. So, I'll introduce XLogCtlSend structure which is shared state data for WAL sending. XLogCtlInsert might need new field LogsndResult which indicates a byte position that we have already sended. As you say, AdvanceXLInsertBuffer() must check both position that we have already written (fsynced) and sended. I'm doing the detail design of this now :) > Notice that the Send process > might be the bottleneck - that is the price of synchronous replication. Really? In the benchmark result of my prototype, the bottleneck is still disk I/O. The communication (between the master and the slave) latency is smaller than WAL writing (fsyncing) one. Of course, I assume that we use not-poor network like 1000BASE-T. What makes the sender process bottleneck? > Backends then wait > * not at all for asynch commit > * just for Write for local synch commit > * for both Write and Send for remote synch commit > (various additional options for what happens to confirm Send) I'd like to introduce new parameter "synchronous_replication" which specifies whether backends waits for the response from WAL sender process. By combining synchronous_commit and synchronous_replication, users can choose various options. > After (or during) XLogInsert backends will sleep in a proc queue, > similar to LWlocks and protected by a spinlock. When preparing to > write/send the WAL process should read the proc at the *tail* of the > queue to see what the next LogwrtRqst should be. Then it performs its > action and wakes procs up starting with the head of the queue. We would > add LSN into PGPROC, so WAL processes can check whether the backend > should be woken. The LSN field can be accessed without spinlocks since > it is only ever set by the backend itself and only read while a backend > is sleeping. So we access spinlock, find tail, drop spinlock then read > LSN of the backend that (was) the tail. You mean only XLogInsert treating "commit record" or every XLogInsert? Anyway, ISTM that the response time get worse :( > Another thought occurs that we might measure the time a Send takes and > specify a limit on how long we are prepared to wait for confirmation. > Limit=0 => asynchronous. Limit > 0 implies synchronous-up-to-the-limit. > This would give better user behaviour across a highly variable network > connection. In the viewpoint of detection of a network failure, this feature is necessary. When the network goes down, WAL sender can be blocked until it detects the network failure, i.e. WAL sender keeps waiting for the response which never comes. A timeout notification is necessary in order to detect a network failure soon. regards -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Hi, ITAGAKI Takahiro wrote: > Signals and locking, borrewed from Postgres-R, are now studied > for the purpose in the log shipping, but I'm not sure it can be > also used in the group commit. Yeah. As Heikki points out, there is a completely orthogonal question WRT group commit: how does transaction A know if or how long it should wait for other transactions to file their WAL? If we decide to do all of the WAL writing from a separate WAL writer process and let the backends communicate with it, then imessages might help again. But I currently don't think that's feasible. Apart from possibly having similar IPC requirements, group commit and log shipping have not much in common and should be considered separate features. Regards Markus Wanner
Hi, Fujii Masao wrote: > Really? In the benchmark result of my prototype, the bottleneck is > still disk I/O. > The communication (between the master and the slave) latency is smaller than > WAL writing (fsyncing) one. Of course, I assume that we use not-poor network > like 1000BASE-T. Sure. If you do WAL sending to standby and WAL writing to disk in parallel, only the slower one is relevant (in case you want to wait for both). If that happens to be the disk, you won't see any performance degradation compared to standalone operation. If you want the standby to confirm having written (and flushed) the WAL to disk as well, that can't possibly be faster than the active node's local disk (assuming equally fast and busy disk subsystems). > I'd like to introduce new parameter "synchronous_replication" which specifies > whether backends waits for the response from WAL sender process. By > combining synchronous_commit and synchronous_replication, users can > choose various options. Various config options have already been proposed. I personally don't think that helps us much. Instead, I'd prefer to see prototype code or at least concepts. We can juggle with the GUC variable names or other config options later on. > In the viewpoint of detection of a network failure, this feature is necessary. > When the network goes down, WAL sender can be blocked until it detects > the network failure, i.e. WAL sender keeps waiting for the response which > never comes. A timeout notification is necessary in order to detect a > network failure soon. That's one of the areas I'm missing from the overall concept. I'm glad it comes up. You certainly realize, that such a timeout must be set high enough so as not to trigger "false negatives" every now and then? Or do you expect some sort of retry loop in case the link to the standby comes up again? How about multiple standby servers? Regards Markus Wanner
On Tue, 2008-09-09 at 20:12 +0900, Fujii Masao wrote: > What makes the sender process bottleneck? In my experience, the Atlantic. But I guess the Pacific does it too. :-) -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Fujii Masao wrote: > What makes the sender process bottleneck? The keyword here is "might". There's many possibilities, like: - Slow network. - Ridiculously fast disk. Like a RAM disk. If you have a synchronous slave you can fail over to, putting WAL on a RAM disk isn't that crazy. - slower WAL disk on the slave. etc. >> Backends then wait >> * not at all for asynch commit >> * just for Write for local synch commit >> * for both Write and Send for remote synch commit >> (various additional options for what happens to confirm Send) > > I'd like to introduce new parameter "synchronous_replication" which specifies > whether backends waits for the response from WAL sender process. By > combining synchronous_commit and synchronous_replication, users can > choose various options. There's one thing I haven't figured out in this discussion. Does the write to the disk happen before or after the write to the slave? Can you guarantee that if a transaction is committed in the master, it's also committed in the slave, or vice versa? >> Another thought occurs that we might measure the time a Send takes and >> specify a limit on how long we are prepared to wait for confirmation. >> Limit=0 => asynchronous. Limit > 0 implies synchronous-up-to-the-limit. >> This would give better user behaviour across a highly variable network >> connection. > > In the viewpoint of detection of a network failure, this feature is necessary. > When the network goes down, WAL sender can be blocked until it detects > the network failure, i.e. WAL sender keeps waiting for the response which > never comes. A timeout notification is necessary in order to detect a > network failure soon. Agreed. But what happens if you hit that timeout? Should we enforce that timeout within the server, or should we leave that to the external heartbeat system? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, 2008-09-09 at 12:54 +0300, Heikki Linnakangas wrote: > Note that we already have the logic to flush all pending commit > records at once. But only when you can grab WALInsertLock when flushing. If you look at the way I suggested, it does not rely upon that lock being available. So it is both responsive in low write rate conditions and yet efficient in high write rate conditions and does not require we specify a wait time. IMHO the idea of a wait time is a confusion that comes from using a simple example (with respect). If we imagine the example slightly differently you'll see a different answer: High write rate: A stream of commits come so fast that by the time a write completes there are always > 1 backends waiting to commit again. In that case, there is never any need to wait because the arrival pattern requries us to issues writes as quickly as we can. Medium write rate: Commits occur relatively frequently, so that the mean commits/flush is in the range 0.5 - 1. In this case, we can get better I/O efficiency by introducing waits. But note that a wait is risky, and at some point we may wait without another commit arriving. In this case, if the disk can keep up with the write rate, why would we want to improve I/O efficiency? There's no a priori way of calculating a useful wait time, so waiting is always a risk. Why would we risk damage to our response times when the disk can keep up with write rate? So for me, introducing a wait is something you might want to consider in medium rate conditions. Anything more or less than that and a wait is useless. So optimising for the case where the arrival rate is within a certain fairly tight range seems not worthwhile. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Tue, 2008-09-09 at 20:12 +0900, Fujii Masao wrote: > I'd like to introduce new parameter "synchronous_replication" which specifies > whether backends waits for the response from WAL sender process. By > combining synchronous_commit and synchronous_replication, users can > choose various options. We already discussed that on -hackers. See "Transaction Controlled Robustness". But yes, something like that. Please note the design mentions fsyncing after applying WAL. I'm sure you're aware we don't fsync after *applying* WAL now, and I hope we never do. You might want to fsync data to WAL files on the standby, but that is a slightly different thing. > > After (or during) XLogInsert backends will sleep in a proc queue, > > similar to LWlocks and protected by a spinlock. When preparing to > > write/send the WAL process should read the proc at the *tail* of the > > queue to see what the next LogwrtRqst should be. Then it performs its > > action and wakes procs up starting with the head of the queue. We would > > add LSN into PGPROC, so WAL processes can check whether the backend > > should be woken. The LSN field can be accessed without spinlocks since > > it is only ever set by the backend itself and only read while a backend > > is sleeping. So we access spinlock, find tail, drop spinlock then read > > LSN of the backend that (was) the tail. > > You mean only XLogInsert treating "commit record" or every XLogInsert? Just the commit records, when synchronous_commit = on. > Anyway, ISTM that the response time get worse :( No, because it would have had to wait in the queue for the WALWriteLock while prior writes occur. If the WALWriter sleeps on a semaphore, it too can be nudged into action at the appropriate time, so no need for a delay between backend beginning to wait and WALWriter beginning to act. (Well, IPC delay between two processes, so some, but that is balanced against efficiency of Send). -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Tue, 2008-09-09 at 13:42 +0200, Markus Wanner wrote: > How about multiple standby servers? There are various ways for getting things to work with multiple servers. I hope we can make this work with just a single standby before we try to make it work on more. There are various options for synchronous and asynchronous relay that will burden us if we try to consider all of that in the remaining 7 weeks we have. So yes please, just not yet. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
"Fujii Masao" <masao.fujii@gmail.com> writes: > On Tue, Sep 9, 2008 at 5:11 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> >> Yes. We should have a LogwrtRqst pointer and LogwrtResult pointer for >> the send operation. The Write and Send operations can then continue >> independently of one another. XLogInsert() cannot advance to a new page >> while we are waiting to send or write. > Agreed. "Agreed"? That last restriction is a deal-breaker. regards, tom lane
On Tue, 2008-09-09 at 08:24 -0400, Tom Lane wrote: > "Fujii Masao" <masao.fujii@gmail.com> writes: > > On Tue, Sep 9, 2008 at 5:11 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > >> > >> Yes. We should have a LogwrtRqst pointer and LogwrtResult pointer for > >> the send operation. The Write and Send operations can then continue > >> independently of one another. XLogInsert() cannot advance to a new page > >> while we are waiting to send or write. > > > Agreed. > > "Agreed"? That last restriction is a deal-breaker. OK, I should have said *if wal_buffers are full* XLogInsert() cannot advance to a new page while we are waiting to send or write. So I don't think its a deal breaker. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes: > On Tue, 2008-09-09 at 08:24 -0400, Tom Lane wrote: >> "Agreed"? That last restriction is a deal-breaker. > OK, I should have said *if wal_buffers are full* XLogInsert() cannot > advance to a new page while we are waiting to send or write. So I don't > think its a deal breaker. Oh, OK, that's obvious --- there's no place to put more data. regards, tom lane
Hi, Le mardi 09 septembre 2008, Heikki Linnakangas a écrit : > The tricky part is, how does A know if it should wait, and for how long? > commit_delay sure isn't ideal, but AFAICS the log shipping proposal > doesn't provide any solution to that. It might just be I'm not understanding what it's all about, but it seems to me with WALSender process A will wait, whatever happens, either until the WAL is sent to slave or written to disk on the slave. I naively read Simon's proposition to consider GroupCommit done with this new feature. A is already waiting (for some external event to complete), why can't we use this for including some other transactions commits into the local deal? Regards, -- dim
Hi, Dimitri Fontaine wrote: > It might just be I'm not understanding what it's all about, but it seems to me > with WALSender process A will wait, whatever happens, either until the WAL is > sent to slave or written to disk on the slave. ..and it will still has to wait until WAL is written to disk on the local node, as we do now. These are two different things to wait for. One is a network socket operation, the other is an fsync(). As these don't work together too well (blocking), you better run that in two different processes. Regards Markus Wanner
Le mardi 09 septembre 2008, Markus Wanner a écrit : > ..and it will still has to wait until WAL is written to disk on the > local node, as we do now. These are two different things to wait for. > One is a network socket operation, the other is an fsync(). As these > don't work together too well (blocking), you better run that in two > different processes. Exactly the point. The process is now already waiting in all cases, so maybe we could just force waiting some WALSender signal before sending the fsync() order, so we now have Group Commit. I'm not sure this is a good idea at all, it's just the way I understand how adding WALSender process in the mix could give Group Commit feature for free. Regards, -- dim
On Tue, 2008-09-09 at 15:32 +0200, Dimitri Fontaine wrote: > The process is now already waiting in all cases If the WALWriter|Sender is available, it can begin the task immediately. There is no need for it to wait if you want synchronous behaviour. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Hi, ITAGAKI Takahiro wrote: > Signals and locking, borrewed from Postgres-R, are now studied > for the purpose in the log shipping, Cool. Let me know if you have any questions WRT this imessages stuff. Regards Markus Wanner
Hi, Dimitri Fontaine wrote: > Exactly the point. The process is now already waiting in all cases, so maybe > we could just force waiting some WALSender signal before sending the fsync() > order, so we now have Group Commit. A single process can only wait on either fsync() or on select(), but not on both concurrently, because both syscalls are blocking. So mixing these into a single process is an inherently bad idea due to lack of parallelism. I fail to see how log shipping would ease or have any other impact on a Group Commit feature, which should clearly also work for stand alone servers, i.e. where there is no WAL sender process. Regards Markus Wanner
Le mardi 09 septembre 2008, Simon Riggs a écrit : > If the WALWriter|Sender is available, it can begin the task immediately. > There is no need for it to wait if you want synchronous behaviour. Ok. Now I'm as lost as anyone with respect to how you get Group Commit :) -- dim
On Tue, 2008-09-09 at 16:05 +0200, Dimitri Fontaine wrote: > Le mardi 09 septembre 2008, Simon Riggs a écrit : > > If the WALWriter|Sender is available, it can begin the task immediately. > > There is no need for it to wait if you want synchronous behaviour. > > Ok. Now I'm as lost as anyone with respect to how you get Group Commit :) OK, sorry. Pls read my reply to Heikki on different subthread of this topic, he had same question of me. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: >> On Tue, 2008-09-09 at 08:24 -0400, Tom Lane wrote: >>> "Agreed"? That last restriction is a deal-breaker. > >> OK, I should have said *if wal_buffers are full* XLogInsert() cannot >> advance to a new page while we are waiting to send or write. So I don't >> think its a deal breaker. > > Oh, OK, that's obvious --- there's no place to put more data. Each WAL sender can keep at most one page locked at a time, right? So, that should never happen if wal_buffers > 1 + n_wal_senders. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, 2008-09-09 at 17:17 +0300, Heikki Linnakangas wrote: > Tom Lane wrote: > > Simon Riggs <simon@2ndQuadrant.com> writes: > >> On Tue, 2008-09-09 at 08:24 -0400, Tom Lane wrote: > >>> "Agreed"? That last restriction is a deal-breaker. > > > >> OK, I should have said *if wal_buffers are full* XLogInsert() cannot > >> advance to a new page while we are waiting to send or write. So I don't > >> think its a deal breaker. > > > > Oh, OK, that's obvious --- there's no place to put more data. > > Each WAL sender can keep at most one page locked at a time, right? So, > that should never happen if wal_buffers > 1 + n_wal_senders. Don't understand. I am referring to the logic at the top of AdvanceXLInsertBuffer(). We would need to wait for all people reading the contents of wal_buffers. Currently, there is no page locking on the WAL buffers, though I have suggested some for increasing XLogInsert() performance. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > Don't understand. I am referring to the logic at the top of > AdvanceXLInsertBuffer(). We would need to wait for all people reading > the contents of wal_buffers. Oh, I see. If a slave falls behind, how does it catch up? I guess you're saying that it can't fall behind, because the master will block before that happens. Also in asynchronous replication? And what about when the slave is first set up, and needs to catch up with the master? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, 2008-09-09 at 18:26 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > Don't understand. I am referring to the logic at the top of > > AdvanceXLInsertBuffer(). We would need to wait for all people reading > > the contents of wal_buffers. > > Oh, I see. > > If a slave falls behind, how does it catch up? That is the right question. > I guess you're saying > that it can't fall behind, because the master will block before that > happens. Also in asynchronous replication? Yes, it can fall behind in async mode. sysadmin must not let it. > And what about when the slave > is first set up, and needs to catch up with the master? We need an initial joining mode while they "match speed". We must allow for the case where the standby has been recycled, or the network has been down for a medium-long period of time. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
> > Don't understand. I am referring to the logic at the top of > > AdvanceXLInsertBuffer(). We would need to wait for all people reading > > the contents of wal_buffers. > > Oh, I see. > > If a slave falls behind, how does it catch up? I guess you're saying > that it can't fall behind, because the master will block before that > happens. Also in asynchronous replication? And what about > when the slave > is first set up, and needs to catch up with the master? I think the WAL Sender needs the ability to read the WAL files directly. In cases where it falls behind, or just started, it needs to be able to catch up. So, it seems we eighter need to copy the WAL buffer into local memory before sending, or "lock" the WAL buffer until send finished. Useful network timeouts are in the >= 5-10 sec range (even for GbE lan), so I don't think locking WAL buffers is feasible. Thus the WAL sender needs to copy (the needed portion of the current WAL buffer) before send (or use async send that immediately returns when the buffer is copied into the network stack). When the WAL sender is ready to continue it eighter still finds the next WAL buffer (or the rest of the current buffer) or it needs to fall back to Plan B and read the WAL files again. A sync client could still wait for the replicate, even if local WAL has already advanced massively. The checkpointer would need the LSN info from WAL senders to not reuse any still needed WAL files, although in that case it might be time to declare the replicate broken. Ideally the WAL sender also knows whether the client waits, so it can decide to send a part of a buffer. The WAL sender should wake and act whenever a "network packet" full of WAL buffer is ready, regardless of commits. Whatever size of send seems appropriate here (might be one WAL page). The WAL Sender should only need to expect a response, when it sent a commit record, ideally only if a client is waiting (and once in a while at least for every log switch). All in all a useful streamer seems like a lot of work. Andreas
On Tue, Sep 9, 2008 at 8:38 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > There's one thing I haven't figured out in this discussion. Does the write > to the disk happen before or after the write to the slave? Can you guarantee > that if a transaction is committed in the master, it's also committed in the > slave, or vice versa? We can guarantee that a transaction is committed in both the master and the slave if we can wait for that one fsyncs WAL to disk and the other holds it to memory or disk. Even if one fails, the other can continue service. Even if both fail, the node which wrote WAL can continue service. A transaction is lost in neither of the cases. > Agreed. But what happens if you hit that timeout? The stand-alone master continues service when it it that timeout. On the other hand, the slave waits for the order by the sysadmin or the clustering software, then it exits or becomes master. > Should we enforce that > timeout within the server, or should we leave that to the external heartbeat > system? Within the server. All users do not use such an external system. It's not simple for the external system to leave the master stand-alone. regards -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, Sep 9, 2008 at 8:42 PM, Markus Wanner <markus@bluegap.ch> wrote: >> In the viewpoint of detection of a network failure, this feature is >> necessary. >> When the network goes down, WAL sender can be blocked until it detects >> the network failure, i.e. WAL sender keeps waiting for the response which >> never comes. A timeout notification is necessary in order to detect a >> network failure soon. > > That's one of the areas I'm missing from the overall concept. I'm glad it > comes up. You certainly realize, that such a timeout must be set high enough > so as not to trigger "false negatives" every now and then? Yes. And, as you know, there is trade-off between the false detection of the network failure and how long WAL sender is blocked. I'll provide not only that timeout but also keepalive for the network between the master and the slave. I expect that keepalive eases that trade-off. regards -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, Sep 10, 2008 at 12:26 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > If a slave falls behind, how does it catch up? I guess you're saying that it > can't fall behind, because the master will block before that happens. Also > in asynchronous replication? And what about when the slave is first set up, > and needs to catch up with the master? The mechanism for the slave to catch up with the master should be provided on the outside of postgres. I think that postgres should provide only WAL streaming, i.e. the master always sends *current* WAL data to the slave. Of course, the master has to send also the current WAL *file* in the initial sending just after the slave starts and connects with it. Because, at the time, current WAL position might be in the middle of WAL file. Even if the master sends only current WAL data, the slave which don't have the corresponding WAL file can not handle it. regards -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, 2008-09-10 at 15:15 +0900, Fujii Masao wrote: > On Wed, Sep 10, 2008 at 12:26 AM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: > > If a slave falls behind, how does it catch up? I guess you're saying that it > > can't fall behind, because the master will block before that happens. Also > > in asynchronous replication? And what about when the slave is first set up, > > and needs to catch up with the master? > > The mechanism for the slave to catch up with the master should be > provided on the outside of postgres. So you mean that we still need to do initial setup (copy backup files and ship and replay WAL segments generated during copy) by external WAL-shipping tools, like walmgr.py, and then at some point switch over to internal WAL-shipping, when we are sure that we are within same WAL file on both master and slave ? > I think that postgres should provide > only WAL streaming, i.e. the master always sends *current* WAL data > to the slave. > > Of course, the master has to send also the current WAL *file* in the > initial sending just after the slave starts and connects with it. I think that it needs to send all WAL files which slave does not yet have, as else the slave will have gaps. On busy system you will generate several new WAL files in the time it takes to make master copy, transfer it to slave and apply WAL files generated during initial setup. > Because, at the time, current WAL position might be in the middle of > WAL file. Even if the master sends only current WAL data, the slave > which don't have the corresponding WAL file can not handle it. I agree, that making initial copy may be outside the scope of Synchronous Log Shipping Replication, but slave catching up by requesting all missing WAL files and applying these up to a point when it can switch to Sync mode should be in. Else we gain very little from this patch. --------------- Hannu
On Wed, Sep 10, 2008 at 12:05 PM, Hannu Krosing <hannu@krosing.net> wrote: > > >> Because, at the time, current WAL position might be in the middle of >> WAL file. Even if the master sends only current WAL data, the slave >> which don't have the corresponding WAL file can not handle it. > > I agree, that making initial copy may be outside the scope of > Synchronous Log Shipping Replication, but slave catching up by > requesting all missing WAL files and applying these up to a point when > it can switch to Sync mode should be in. Else we gain very little from > this patch. > I agree. We should leave the initial backup acquisition out of the scope atleast for the first phase, but provide mechanism to do initial catch up, as it may get messy to do it completely outside of the core. The slave will need to able to buffer the *current* WAL until it gets the missing WAL files and then continue. Also we may not want the master to be stuck while slave is doing the catchup. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
On Wed, 2008-09-10 at 09:35 +0300, Hannu Krosing wrote: > On Wed, 2008-09-10 at 15:15 +0900, Fujii Masao wrote: > > On Wed, Sep 10, 2008 at 12:26 AM, Heikki Linnakangas > > <heikki.linnakangas@enterprisedb.com> wrote: > > > If a slave falls behind, how does it catch up? I guess you're saying that it > > > can't fall behind, because the master will block before that happens. Also > > > in asynchronous replication? And what about when the slave is first set up, > > > and needs to catch up with the master? > > > > The mechanism for the slave to catch up with the master should be > > provided on the outside of postgres. > > So you mean that we still need to do initial setup (copy backup files > and ship and replay WAL segments generated during copy) by external > WAL-shipping tools, like walmgr.py, and then at some point switch over > to internal WAL-shipping, when we are sure that we are within same WAL > file on both master and slave ? > > > I think that postgres should provide > > only WAL streaming, i.e. the master always sends *current* WAL data > > to the slave. > > > > Of course, the master has to send also the current WAL *file* in the > > initial sending just after the slave starts and connects with it. > > I think that it needs to send all WAL files which slave does not yet > have, as else the slave will have gaps. On busy system you will generate > several new WAL files in the time it takes to make master copy, transfer > it to slave and apply WAL files generated during initial setup. > > > Because, at the time, current WAL position might be in the middle of > > WAL file. Even if the master sends only current WAL data, the slave > > which don't have the corresponding WAL file can not handle it. > > I agree, that making initial copy may be outside the scope of > Synchronous Log Shipping Replication, but slave catching up by > requesting all missing WAL files and applying these up to a point when > it can switch to Sync mode should be in. Else we gain very little from > this patch. I agree with Hannu. Any working solution needs to work for all required phases. If you did it this way, you'd never catch up at all. When you first make the copy, it will be made at time X. The point of consistency will be sometime later and requires WAL data to make it consistent. So you would need to do a PITR to get it to the point of consistency. While you've been doing that, the primary server has moved on and now there is a gap between primary and standby. You *must* provide a facility to allow the standby to catch up with the primary. Only sending *current* WAL is not a solution, and not acceptable. So there must be mechanisms for sending past *and* current WAL data to the standby, and an exact and careful mechanism for switching between the two modes when the time is right. Replication is only synchronous *after* the change in mode. So the protocol needs to be something like: 1. Standby contacts primary and says it would like to catch up, but is currently at point X (which is a point at, or after the first consistent stopping point in WAL after standby has performed its own crash recovery, if any was required). 2. primary initiates data transfer of old data to standby, starting at point X 3. standby tells primary where it has got to periodically 4. at some point primary decides primary and standby are close enough that it can now begin streaming "current WAL" (which is always the WAL up to wal_buffers behind the the current WAL insertion point). Bear in mind that unless wal_buffers > 16MB the final catchup will *always* be less than one WAL file, so external file based mechanisms alone could never be enough. So you would need wal_buffers >= 2000 to make an external catch up facility even work at all. This also probably means that receipt of WAL data on the standby cannot be achieved by placing it in wal_buffers. So we probably need to write it directly to the WAL files, then rely on the filesystem cache on the standby to buffer the data for use by ReadRecord. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Wed, 2008-09-10 at 13:28 +0900, Fujii Masao wrote: > On Tue, Sep 9, 2008 at 8:38 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: > > There's one thing I haven't figured out in this discussion. Does the write > > to the disk happen before or after the write to the slave? Can you guarantee > > that if a transaction is committed in the master, it's also committed in the > > slave, or vice versa? > The write happens concurrently and independently on both. Yes, you wait for the write *and* send pointer to be "flushed" before you allow a synch commit with synch replication. (Definition of flushed is changeable by parameters). -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Wed, 2008-09-10 at 12:24 +0530, Pavan Deolasee wrote: > Also we may not want the master to be stuck while slave is doing the catchup. No, since it may take hours, not seconds. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Hi, Simon Riggs wrote: > 1. Standby contacts primary and says it would like to catch up, but is > currently at point X (which is a point at, or after the first consistent > stopping point in WAL after standby has performed its own crash > recovery, if any was required). > 2. primary initiates data transfer of old data to standby, starting at > point X > 3. standby tells primary where it has got to periodically > 4. at some point primary decides primary and standby are close enough > that it can now begin streaming "current WAL" (which is always the WAL > up to wal_buffers behind the the current WAL insertion point). Hm.. wouldn't it be simpler, to start streaming right away and "cache" that on the standby until it can be applied? I.e. a protocol like: 1. - same as above - 2. primary starts streaming from live or hot data from it's current position Y in the WAL stream, which is certainly after (or probably equal to) X. 3. standby receives the hot stream from point Y on. It now knows it misses 'cold' portions of the WAL from X to Y and requests that. 4. primary serves remaining 'cold' WAL chunks from its xlog / archive from between X and Y. 5. standby applies 'cold' WAL, until done. Then proceeds with the cached WAL segments from 'hot' streaming. > Bear in mind that unless wal_buffers > 16MB the final catchup will > *always* be less than one WAL file, so external file based mechanisms > alone could never be enough. Agreed. > This also probably means that receipt of WAL data on the standby cannot > be achieved by placing it in wal_buffers. So we probably need to write > it directly to the WAL files, then rely on the filesystem cache on the > standby to buffer the data for use by ReadRecord. Makes sense, especially in case of cached WAL as outlined above. Is this a problem in any way? Regards Markus Wanner
Simon Riggs wrote: > On Wed, 2008-09-10 at 13:28 +0900, Fujii Masao wrote: >> On Tue, Sep 9, 2008 at 8:38 PM, Heikki Linnakangas >> <heikki.linnakangas@enterprisedb.com> wrote: >>> There's one thing I haven't figured out in this discussion. Does the write >>> to the disk happen before or after the write to the slave? Can you guarantee >>> that if a transaction is committed in the master, it's also committed in the >>> slave, or vice versa? > > The write happens concurrently and independently on both. > > Yes, you wait for the write *and* send pointer to be "flushed" before > you allow a synch commit with synch replication. (Definition of flushed > is changeable by parameters). The thing that bothers me is the behavior when the synchronous slave doesn't respond. A timeout has been discussed, after which the master just gives up on sending, and starts acting as if there's no slave. How's that different from asynchronous mode where WAL is sent to the server concurrently when it's flushed to disk, but we don't wait for the send to finish? ISTM that in both cases the only guarantee we can give is that when a transaction is acknowledged as committed, it's committed in the master but not necessarily in the slave. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, Sep 10, 2008 at 1:40 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > > > The thing that bothers me is the behavior when the synchronous slave doesn't > respond. A timeout has been discussed, after which the master just gives up > on sending, and starts acting as if there's no slave. How's that different > from asynchronous mode where WAL is sent to the server concurrently when > it's flushed to disk, but we don't wait for the send to finish? ISTM that in > both cases the only guarantee we can give is that when a transaction is > acknowledged as committed, it's committed in the master but not necessarily > in the slave. > I think there is one difference. Assuming that the timeouts happen infrequently, most of the time the slave is in sync with the master and that can be reported to the user. Whereas in async mode, the slave will *always* be out of sync. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
On Wed, 2008-09-10 at 11:10 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Wed, 2008-09-10 at 13:28 +0900, Fujii Masao wrote: > >> On Tue, Sep 9, 2008 at 8:38 PM, Heikki Linnakangas > >> <heikki.linnakangas@enterprisedb.com> wrote: > >>> There's one thing I haven't figured out in this discussion. Does the write > >>> to the disk happen before or after the write to the slave? Can you guarantee > >>> that if a transaction is committed in the master, it's also committed in the > >>> slave, or vice versa? > > > > The write happens concurrently and independently on both. > > > > Yes, you wait for the write *and* send pointer to be "flushed" before > > you allow a synch commit with synch replication. (Definition of flushed > > is changeable by parameters). > > The thing that bothers me is the behavior when the synchronous slave > doesn't respond. A timeout has been discussed, after which the master > just gives up on sending, and starts acting as if there's no slave. > How's that different from asynchronous mode where WAL is sent to the > server concurrently when it's flushed to disk, but we don't wait for the > send to finish? ISTM that in both cases the only guarantee we can give > is that when a transaction is acknowledged as committed, it's committed > in the master but not necessarily in the slave. We should differentiate between what the WALsender does and what the user does in response to a network timeout. Saying "I want to wait for a synchronous commit and I am willing to wait for ever to ensure it" leads to long hangs in some cases. I was suggesting that some users may wish to wait up to time X before responding to the commit. The WALSender may keep retrying long after that point, but that doesn't mean all current users need to do that also. The user would need to say whether the response to the timeout was an error, or just accept and get on with it. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > Saying "I want to wait for a synchronous commit and I am willing to wait > for ever to ensure it" leads to long hangs in some cases. Sure. That's the fundamental problem with synchronous replication. That's why many people choose asynchronous replication instead. Clearly at some point you'll want to give up and continue without the slave, or kill the master and fail over to the slave. I'm wondering how that's different than the lag between master and server in asynchronous replication from the client's point of view. > I was suggesting that some users may wish to wait up to time X before > responding to the commit. The WALSender may keep retrying long after > that point, but that doesn't mean all current users need to do that > also. The user would need to say whether the response to the timeout was > an error, or just accept and get on with it. I'm not sure I understand that paragraph. Who's the user? Do we need to expose some new information to the client so that it can do something? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, 2008-09-09 at 20:59 +0200, Zeugswetter Andreas OSB sIT wrote: > All in all a useful streamer seems like a lot of work. I mentioned some time ago an alternative idea of having the slave connect through a normal SQL connection and call a function which streams the WAL file from the point requested by the slave... wouldn't that be feasible ? All the connection part would be already there, only the streaming function should be implemented. It even could use SSL connections if needed. Then you would have one normal backend per slave, and they should access either the files directly or possibly some shared area where the WAL is buffered for this purpose... the streaming function could also take care of signaling the "up-to-dateness" of the slaves in case of synchronous replication. There could also be some system table infrastructure to track the slaves. There could also be some functions to stream the files of the DB through normal backends, so a slave could be bootstrapped all the way from copying the files through a simple postgres backend connection... that would make for the easiest possible setup of a slave: configure a connection to the master, and hit "run"... and last but not least the same interface could be used by a PITR repository client for archiving the WAL stream and occasional file system snapshots. Cheers, Csaba.
On Wed, 2008-09-10 at 10:06 +0200, Markus Wanner wrote: > Hi, > > Simon Riggs wrote: > > 1. Standby contacts primary and says it would like to catch up, but is > > currently at point X (which is a point at, or after the first consistent > > stopping point in WAL after standby has performed its own crash > > recovery, if any was required). > > 2. primary initiates data transfer of old data to standby, starting at > > point X > > 3. standby tells primary where it has got to periodically > > 4. at some point primary decides primary and standby are close enough > > that it can now begin streaming "current WAL" (which is always the WAL > > up to wal_buffers behind the the current WAL insertion point). > > Hm.. wouldn't it be simpler, to start streaming right away and "cache" > that on the standby until it can be applied? I.e. a protocol like: Good idea! This makes everything simpler, as user has to do only 4 things 1. start slave in "receive WAL, dont apply" mode 2. start walshipping on master 3. copy files from master to slave. 4. restart slave in "receive WAL" mode all else will happen automatically. --------------- Hannu
On Wed, 2008-09-10 at 08:15 +0100, Simon Riggs wrote: > Any working solution needs to work for all required phases. If you did > it this way, you'd never catch up at all. > > When you first make the copy, it will be made at time X. The point of > consistency will be sometime later and requires WAL data to make it > consistent. So you would need to do a PITR to get it to the point of > consistency. While you've been doing that, the primary server has moved > on and now there is a gap between primary and standby. You *must* > provide a facility to allow the standby to catch up with the primary. > Only sending *current* WAL is not a solution, and not acceptable. > > So there must be mechanisms for sending past *and* current WAL data to > the standby, and an exact and careful mechanism for switching between > the two modes when the time is right. Replication is only synchronous > *after* the change in mode. > > So the protocol needs to be something like: > > 1. Standby contacts primary and says it would like to catch up, but is > currently at point X (which is a point at, or after the first consistent > stopping point in WAL after standby has performed its own crash > recovery, if any was required). > 2. primary initiates data transfer of old data to standby, starting at > point X > 3. standby tells primary where it has got to periodically > 4. at some point primary decides primary and standby are close enough > that it can now begin streaming "current WAL" (which is always the WAL > up to wal_buffers behind the the current WAL insertion point). > > Bear in mind that unless wal_buffers > 16MB the final catchup will > *always* be less than one WAL file, so external file based mechanisms > alone could never be enough. So you would need wal_buffers >= 2000 to > make an external catch up facility even work at all. > > This also probably means that receipt of WAL data on the standby cannot > be achieved by placing it in wal_buffers. So we probably need to write > it directly to the WAL files, then rely on the filesystem cache on the > standby to buffer the data for use by ReadRecord. And this catchup may be needed to be done repeatedly, in case of network failure. I don't think that slave automatically becoming a master if it detects network failure (as suggested elsewhere in this thread) is acceptable solution, as it will more often than not result in two masters. A better solution would be: 1. Slave just keeps waiting for new WAL records and confirming receipt storing to disk and application. 2. Master is in one of at least two states 2.1 - Catchup - Async mode where it is sending old logs and wal records to slave 2.2 - Sync Replication - Sync mode, where COMMIT does not return before confirmation from WALSender. Initial mode is Catchup which is promoted to Sync Replication when delay of WAL application is reasonably small. When Master detects network outage (== delay bigger than acceptable) it will either just Send a NOTICE to all clients and fall back to Catchup, or raise an ERROR (and still fall back to cathup) This is the point where external HA / Heartbeat etc. solutions would intervene and decide, what to do. ----------------- Hannu
On Wed, Sep 10, 2008 at 4:15 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Wed, 2008-09-10 at 09:35 +0300, Hannu Krosing wrote: >> On Wed, 2008-09-10 at 15:15 +0900, Fujii Masao wrote: >> > On Wed, Sep 10, 2008 at 12:26 AM, Heikki Linnakangas >> > <heikki.linnakangas@enterprisedb.com> wrote: >> > > If a slave falls behind, how does it catch up? I guess you're saying that it >> > > can't fall behind, because the master will block before that happens. Also >> > > in asynchronous replication? And what about when the slave is first set up, >> > > and needs to catch up with the master? >> > >> > The mechanism for the slave to catch up with the master should be >> > provided on the outside of postgres. >> >> So you mean that we still need to do initial setup (copy backup files >> and ship and replay WAL segments generated during copy) by external >> WAL-shipping tools, like walmgr.py, and then at some point switch over >> to internal WAL-shipping, when we are sure that we are within same WAL >> file on both master and slave ? >> >> > I think that postgres should provide >> > only WAL streaming, i.e. the master always sends *current* WAL data >> > to the slave. >> > >> > Of course, the master has to send also the current WAL *file* in the >> > initial sending just after the slave starts and connects with it. >> >> I think that it needs to send all WAL files which slave does not yet >> have, as else the slave will have gaps. On busy system you will generate >> several new WAL files in the time it takes to make master copy, transfer >> it to slave and apply WAL files generated during initial setup. >> >> > Because, at the time, current WAL position might be in the middle of >> > WAL file. Even if the master sends only current WAL data, the slave >> > which don't have the corresponding WAL file can not handle it. >> >> I agree, that making initial copy may be outside the scope of >> Synchronous Log Shipping Replication, but slave catching up by >> requesting all missing WAL files and applying these up to a point when >> it can switch to Sync mode should be in. Else we gain very little from >> this patch. > > I agree with Hannu. > > Any working solution needs to work for all required phases. If you did > it this way, you'd never catch up at all. > > When you first make the copy, it will be made at time X. The point of > consistency will be sometime later and requires WAL data to make it > consistent. So you would need to do a PITR to get it to the point of > consistency. While you've been doing that, the primary server has moved > on and now there is a gap between primary and standby. You *must* > provide a facility to allow the standby to catch up with the primary. > Only sending *current* WAL is not a solution, and not acceptable. > > So there must be mechanisms for sending past *and* current WAL data to > the standby, and an exact and careful mechanism for switching between > the two modes when the time is right. Replication is only synchronous > *after* the change in mode. > > So the protocol needs to be something like: > > 1. Standby contacts primary and says it would like to catch up, but is > currently at point X (which is a point at, or after the first consistent > stopping point in WAL after standby has performed its own crash > recovery, if any was required). > 2. primary initiates data transfer of old data to standby, starting at > point X > 3. standby tells primary where it has got to periodically > 4. at some point primary decides primary and standby are close enough > that it can now begin streaming "current WAL" (which is always the WAL > up to wal_buffers behind the the current WAL insertion point). > > Bear in mind that unless wal_buffers > 16MB the final catchup will > *always* be less than one WAL file, so external file based mechanisms > alone could never be enough. So you would need wal_buffers >= 2000 to > make an external catch up facility even work at all. > > This also probably means that receipt of WAL data on the standby cannot > be achieved by placing it in wal_buffers. So we probably need to write > it directly to the WAL files, then rely on the filesystem cache on the > standby to buffer the data for use by ReadRecord. > > -- > Simon Riggs www.2ndQuadrant.com > PostgreSQL Training, Services and Support > > Umm.. I disagree with you ;) Here is my initial setup sequence. 1) Start WAL receiver. The current WAL file and subsequent ones will be transmitted by WAL sender and WAL receiver.This transmission will not block the following operation for initial setup, and vice versa. That is, the slavecan catch up with the master without blocking the master. I cannot accept that WAL sender is blocked for initialsetup. 2) Copy the missing history files from the master to the slave. 3) Prepare recovery.conf on the slave. You have to configure pg_standby and set recovery_target_timeline to 'latest'or the current TLI of the master. 4) Start postgres. The startup process and pg_standby start archive recovery. If there are missing WAL files, pg_standbywaits for it and WAL replay is suspended. 5) Copy the missing WAL files from the master and the slave. Of course, we don't need to copy the WAL files which are transmitted by WAL sender and WAL receiver. Then, the recovery is resumed. My sequence covers several cases : * There is no missing WAL file. * There is a lot of missing WAL file. * There are missing history files. Failover always generates the gap of history file because TLI is incremented when archiverecovery is completed. ... In your design, does not initial setup block the master? Does your design cover above-mentioned case? regards -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, 2008-09-10 at 10:06 +0200, Markus Wanner wrote: > Hi, > > Simon Riggs wrote: > > 1. Standby contacts primary and says it would like to catch up, but is > > currently at point X (which is a point at, or after the first consistent > > stopping point in WAL after standby has performed its own crash > > recovery, if any was required). > > 2. primary initiates data transfer of old data to standby, starting at > > point X > > 3. standby tells primary where it has got to periodically > > 4. at some point primary decides primary and standby are close enough > > that it can now begin streaming "current WAL" (which is always the WAL > > up to wal_buffers behind the the current WAL insertion point). > > Hm.. wouldn't it be simpler, to start streaming right away and "cache" The standby server won't come up until you have: * copied the base backup * sent it to standby server * bring up standby, have it realise it is a replication partner and begin requesting WAL from primary (in some way) There will be a gap (probably) between the initial WAL files and the current tail of wal_buffers by the time all of the above has happened. We will then need to copy more WAL across until we get to a point where the most recent WAL record available on standby is ahead of the tail of wal_buffers on primary so that streaming can start. If we start caching WAL right away we would need to have two receivers. One to receive the missing WAL data and one to receive the current WAL data. We can't apply the WAL until we have the earlier missing WAL data, so cacheing it seems difficult. On a large server this might be GBs of data. Seems easier to not cache current WAL and to have just a single WALReceiver process that performs a mode change once it has caught up. (And I should say "if it catches up", since it is possible that it never actually will catch up, in practical terms, since this depends upon the relative power of the servers involved.). So there's no need to store more WAL on standby than is required to restart recovery from last restartpoint. i.e. we stream WAL at all times, not just in normal running mode. Seems easiest to have: * Startup process only reads next WAL record when the ReceivedLogPtr > ReadRecPtr, so it knows nothing of how WAL is received. Startup process reads directly from WAL files in *all* cases. ReceivedLogPtr is in shared memory and accessed via spinlock. Startup process only ever reads this pointer. (Notice that Startup process is modeless). * WALReceiver reads data from primary and writes it to WAL files, fsyncing (if ever requested to do so). WALReceiver updates ReceivedLogPtr. That is much simpler and more modular. Buffering of the WAL files is handled by filesystem buffering. If standby crashes, all data is safely written to WAL files and we restart from correct place. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Wed, 2008-09-10 at 17:57 +0900, Fujii Masao wrote: > I cannot accept that WAL sender is blocked for initial setup. Yes, very important point. We definitely agree on that. The primary must be able to continue working while all this setup is happening. No tables are locked, all commits are allowed etc.. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Hi, Le mercredi 10 septembre 2008, Heikki Linnakangas a écrit : > Sure. That's the fundamental problem with synchronous replication. > That's why many people choose asynchronous replication instead. Clearly > at some point you'll want to give up and continue without the slave, or > kill the master and fail over to the slave. I'm wondering how that's > different than the lag between master and server in asynchronous > replication from the client's point of view. As a future user of this new facilities, the difference from client's POV is simple : in normal mode of operation, we want a strong guarantee that any COMMIT has made it to both the master and the slave at commit time. No lag whatsoever. You're considering lag as an option in case of failure, but I don't see this as acceptable when you need sync commit. In case of network timeout, cluster is down. So you want to either continue servicing in degraged mode or get the service down while you repair the cluster, but neither of those choice can be transparent to the admins, I'd argue. Of course, main use case is high availability, which tends to say you do not have the option to stop service, and seems to dictate continue servicing in degraded mode: slave can't keep up (whatever the error domain), master is alone, "advertise" to monitoring solutions and continue servicing. And provide some way for the slave to "rejoin", maybe, too. > I'm not sure I understand that paragraph. Who's the user? Do we need to > expose some new information to the client so that it can do something? Maybe with some GUCs where to set the acceptable "timeout" for WAL sync process, and if reaching timeout is a warning or an error. With a userset GUC we could event have replication-error-level transaction concurrent to non critical ones... Now what to do exactly in case of error remains to be decided... HTH, Regards, -- dim
On Wed, 2008-09-10 at 17:57 +0900, Fujii Masao wrote: > Umm.. I disagree with you ;) That's no problem and I respect your knowledge. If we disagree, it is very likely because we have misunderstood each other. Much has been written, so I will wait for it to all be read and understood by you and others, and for me to read other posts and replies also. I feel sure that after some thought a clear consensus will emerge, and I feel hopeful that the feature can be done in the time available with simple code changes. So I will stop replying for a few hours to give everyone time (incl me). -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Hi, Simon Riggs wrote: > The standby server won't come up until you have: > * copied the base backup > * sent it to standby server > * bring up standby, have it realise it is a replication partner and > begin requesting WAL from primary (in some way) Right. That was your assumption as well. Required before step 1 in both cases. > If we start caching WAL right away we would need to have two receivers. > One to receive the missing WAL data and one to receive the current WAL > data. We can't apply the WAL until we have the earlier missing WAL data, > so cacheing it seems difficult. You could use the same receiver process and just handle different packets differently. I see no need for two separate receiver processes here. > On a large server this might be GBs of > data. ..if served from a log archive, correct. Without archiving, we are limited to xlog anyway. > Seems easier to not cache current WAL and to have just a single > WALReceiver process that performs a mode change once it has caught up. > (And I should say "if it catches up", since it is possible that it never > actually will catch up, in practical terms, since this depends upon the > relative power of the servers involved.). So there's no need to store > more WAL on standby than is required to restart recovery from last > restartpoint. i.e. we stream WAL at all times, not just in normal > running mode. Switching between streaming from files and 'live' streaming on the active node seems difficult to me, because you need to make sure there's no gap. That problem could be circumvented by handling this on the standby. If you think switching on the active is simple enough, that's fine. > Seems easiest to have: > * Startup process only reads next WAL record when the ReceivedLogPtr > > ReadRecPtr, so it knows nothing of how WAL is received. Startup process > reads directly from WAL files in *all* cases. ReceivedLogPtr is in > shared memory and accessed via spinlock. Startup process only ever reads > this pointer. (Notice that Startup process is modeless). Well, that's certainly easier for the standby, but requires mode switching on the active. Regards Markus Wanner
On Wed, 2008-09-10 at 11:07 +0200, Dimitri Fontaine wrote: > Hi, > > Le mercredi 10 septembre 2008, Heikki Linnakangas a écrit : > > Sure. That's the fundamental problem with synchronous replication. > > That's why many people choose asynchronous replication instead. Clearly > > at some point you'll want to give up and continue without the slave, or > > kill the master and fail over to the slave. I'm wondering how that's > > different than the lag between master and server in asynchronous > > replication from the client's point of view. > > As a future user of this new facilities, the difference from client's POV is > simple : in normal mode of operation, we want a strong guarantee that any > COMMIT has made it to both the master and the slave at commit time. No lag > whatsoever. Agreed. > You're considering lag as an option in case of failure, but I don't see this > as acceptable when you need sync commit. In case of network timeout, cluster > is down. So you want to either continue servicing in degraged mode or get the > service down while you repair the cluster, but neither of those choice can be > transparent to the admins, I'd argue. > > Of course, main use case is high availability, which tends to say you do not > have the option to stop service, We have a number of choices, at the point of failure: * Does the whole primary server stay up (probably)? * Do we continue to allow new transactions in degraded mode? (which increases the risk of transaction loss if we continue at that time). (The answer sounds like it will be "of course, stupid" but this cluster may be part of an even higher level HA mechanism, so the answer isn't always clear). * For each transaction that is trying to commit: do we want to wait forever? If not, how long? If we stop waiting, do we throw ERROR, or do we say, lets get on with another transaction. If the server is up, yet all connections in a session pool are stuck waiting for their last commits to complete then most sysadmins would agree that the server is actually "down". Since no useful work is happening, or can be initiated - even read only. We don't need to address that issue in the same way for all transactions, is all I'm saying. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
* Simon Riggs <simon@2ndQuadrant.com> [080910 06:18]: > We have a number of choices, at the point of failure: > * Does the whole primary server stay up (probably)? The only sane choice is the one the admin makes. Any "predetermined" choice PG makes can (and will) be wrong in some situations. > * Do we continue to allow new transactions in degraded mode? (which > increases the risk of transaction loss if we continue at that time). > (The answer sounds like it will be "of course, stupid" but this cluster > may be part of an even higher level HA mechanism, so the answer isn't > always clear). The only sane choice is the one the admin makes. Any "predetermined" choice PG makes can (and will) be wrong in some situations. > * For each transaction that is trying to commit: do we want to wait > forever? If not, how long? If we stop waiting, do we throw ERROR, or do > we say, lets get on with another transaction. The only sane choice is the one the admin makes. Any "predetermined" choice PG makes can (and will) be wrong in some situations. > If the server is up, yet all connections in a session pool are stuck > waiting for their last commits to complete then most sysadmins would > agree that the server is actually "down". Since no useful work is > happening, or can be initiated - even read only. We don't need to > address that issue in the same way for all transactions, is all I'm > saying. Sorry to sound like a broken record here, but the whole point is to guarantee data safety. You can only start trading ACID for HA if you have the ACID guarantees in the first place (and for replication, this means across the cluster, including slaves) So in that light, I think it's pretty obvious that if a slave is considered part of an active synchronous replication cluster, in the face of "network lag", or even network failure, the master *must* pretty much halt all new commits in their tracks until that slave acknowledges the commit. Yes that's going to cause a backup. That's the cost of a synchronous replication. But that means the admin has to be able to control whether a slave is part of an active synchronous replication cluster or not. I hope that control eventually is a lot more than a GUC that says "when a slave is X seconds behind, abandon him). I'ld dream of a "replication" interface where I could add new slaves on the fly (and a nice tool that pg_start_backup()/sync/apply WAL to sync then subscribe), get slave status (maybe syncing/active/abandoned), and some average latency (i.e. something like svctm of iostat on your WAL disk) and some way to control the slave degradation from active to abandoned (like the above GUC, or maybe a callout/hook/script that runs when latency > X, etc, or both). And for async replication, you just have a "proxy" slave which does nothing but subscribe to your master, always acknowledge WAL right away so the master doesn't wait, and keep a local backlog of WAL it's sending out to many clients. This proxy slave doesn't slow down the master, but can feed clients accross slow WAN links (that may not have the burst bandwidth to keep up with bursty master writes, but have agregate bandwidth to keep pretty close to the master), or networks that drop out for a period, etc. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
On Wed, 2008-09-10 at 09:36 -0400, Aidan Van Dyk wrote: > * Simon Riggs <simon@2ndQuadrant.com> [080910 06:18]: > > > We have a number of choices, at the point of failure: > > * Does the whole primary server stay up (probably)? > > The only sane choice is the one the admin makes. Any "predetermined" choice > PG makes can (and will) be wrong in some situations. We are in agreement then. Those questions were listed as arguments in favour of a parameter to let the sysadmin choose. More than that, I was saying this can be selected for individual transactions, not just for the whole server as a whole (as other vendors do). -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Tue, Sep 9, 2008 at 10:55 PM, Markus Wanner <markus@bluegap.ch> wrote: > Hi, > > ITAGAKI Takahiro wrote: >> >> Signals and locking, borrewed from Postgres-R, are now studied >> for the purpose in the log shipping, > > Cool. Let me know if you have any questions WRT this imessages stuff. If you're sure it's all right, I have a trivial question. Which signal should we use for the notification to the backend from WAL sender? The notable signals are already used. Or, since a backend don't need to wait on select() unlike WAL sender, ISTM that it's not so inconvenient to use a semaphore for that notification. Your thought? regards -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Hi, Fujii Masao wrote: > On Tue, Sep 9, 2008 at 10:55 PM, Markus Wanner <markus@bluegap.ch> wrote: >> Hi, >> >> ITAGAKI Takahiro wrote: >>> Signals and locking, borrewed from Postgres-R, are now studied >>> for the purpose in the log shipping, >> Cool. Let me know if you have any questions WRT this imessages stuff. > > If you're sure it's all right, I have a trivial question. Well, I know it works for me and I think it could work for you, too. That's all I'm saying. > Which signal should we use for the notification to the backend from > WAL sender? The notable signals are already used. I'm using SIGUSR1, see src/backend/storage/ipc/imsg.c from Postgres-R, line 232. That isn't is use for backends or the postmaster, AFAIK. > Or, since a backend don't need to wait on select() unlike WAL sender, > ISTM that it's not so inconvenient to use a semaphore for that notification. They probably could, but not the WAL sender. What's the benefit of semaphores? It seems pretty ugly to set up a semaphore, lock that on the WAL sender, then claim it on the backend to wait for it, and then release it on the WAL sender to notify the backend. If all you want to do is to signal the backend, why not use signals ;-) But maybe I'm missing something? Regards Markus Wanner
On Wed, 2008-09-10 at 17:57 +0900, Fujii Masao wrote: > My sequence covers several cases : > > * There is no missing WAL file. > * There is a lot of missing WAL file. This is the likely case for any medium+ sized database. > * There are missing history files. Failover always generates the gap > of > history file because TLI is incremented when archive recovery is > completed. Yes, but failover doesn't happen while we are configuring replication, it can only happen after we have configured replication. It would be theoretically possible to take a copy from one server and then try to synchronise with a 3rd copy of the same server, but that seems perverse and bug prone. So I advise that we only allow replication when the timeline of the standby matches the timeline of the master, having it as an explicit check. > In your design, does not initial setup block the master? > Does your design cover above-mentioned case? The way I described it does not block the master. It does defer the point at which we can start using synchronous replication, so perhaps that is your objection. I think it is acceptable: good food takes time to cook. I have thought about the approach you've outlined, though it seems to me now like a performance optimisation rather than something we must have. IMHO it will be confusing to be transferring both old and new data at the same time from master to slave. We will have two different processes sending and two different processes receiving. You'll need to work through about four times as many failure modes, all of which will need testing. Diagnosing problems in it via the log hurts my head just thinking about it. ISTM that will severely impact the initial robustness of the software for this feature. Perhaps in time it is the right way. Anyway, feels like we're getting close to some good designs. There isn't much difference between what we're discussing here. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Wed, Sep 10, 2008 at 11:13 PM, Markus Wanner <markus@bluegap.ch> wrote: >> Which signal should we use for the notification to the backend from >> WAL sender? The notable signals are already used. > > I'm using SIGUSR1, see src/backend/storage/ipc/imsg.c from Postgres-R, line > 232. That isn't is use for backends or the postmaster, AFAIK. Umm... backends have already used SIGUSR1. PostgresMain() sets up a signal handler for SIGUSR1 as follows. pqsignal(SIGUSR1, CatchupInterruptHandler); Which signal should WAL sender send to backends? >> Or, since a backend don't need to wait on select() unlike WAL sender, >> ISTM that it's not so inconvenient to use a semaphore for that >> notification. > > They probably could, but not the WAL sender. Yes, since WAL sender waits on select(), it's convenient to use signal for the notification *from backends to WAL sender*, I think too. Best regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
"Fujii Masao" <masao.fujii@gmail.com> writes: > Which signal should WAL sender send to backends? Sooner or later we shall have to bite the bullet and set up a multiplexing system to transmit multiple event types to backends with just one signal. We already did it for signals to the postmaster. regards, tom lane
Hi, Fujii Masao wrote: > Umm... backends have already used SIGUSR1. PostgresMain() sets up a signal > handler for SIGUSR1 as follows. Uh.. right. Thanks for pointing that out. Maybe just use SIGPIPE for now? > Yes, since WAL sender waits on select(), it's convenient to use signal > for the notification *from backends to WAL sender*, I think too. ..and I'd say you better use the same for WAL sender to backend communication, just for the sake of simplicity (and thus maintainability). Regards Markus Wanner
Hi, Tom Lane wrote: > Sooner or later we shall have to bite the bullet and set up a > multiplexing system to transmit multiple event types to backends with > just one signal. We already did it for signals to the postmaster. Agreed. However, it's non-trivial if you want reliable queues (i.e. no message skipped, as with signals) for varying message sizes. My imessages stuff is certainly not perfect, yet. But it works to some extent and provides exactly that functionality. However, I'd be happy to work on improving it, if other projects start using it as well. Anybody else interested? Use cases within Postgres itself as of now? Regards Markus Wanner
On Thu, Sep 11, 2008 at 3:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Wed, 2008-09-10 at 17:57 +0900, Fujii Masao wrote: > >> My sequence covers several cases : >> >> * There is no missing WAL file. >> * There is a lot of missing WAL file. > > This is the likely case for any medium+ sized database. I'm sorry, but I could not understand what you mean. > >> * There are missing history files. Failover always generates the gap >> of >> history file because TLI is incremented when archive recovery is >> completed. > > Yes, but failover doesn't happen while we are configuring replication, > it can only happen after we have configured replication. It would be > theoretically possible to take a copy from one server and then try to > synchronise with a 3rd copy of the same server, but that seems perverse > and bug prone. So I advise that we only allow replication when the > timeline of the standby matches the timeline of the master, having it as > an explicit check. Umm... my explanation seems to have been unclear:( Here is the case which I assume. 1) Configuration of replication, i.e. the master and the slave work fine. 2) The master fails down, then failover happens. When the slave becomes the master, TLI is incremented, and new historyfile is generated. 3) In order to catch up with the new master, the server which was the master from the first needs missing history file.At this time, it's because there is the gap of TLI in between two servers. I think that this case would often happen. So, we should establish a certain solution or procedure to the case where TLI of the master doesn't match TLI of the slave. If we only allow the case where TLI of both servers is the same, the configuration after failover always needs to get the base backup on the new master. It's unacceptable for many users. But, I think that it's the role of admin or external tools to copy history files to the slave from the master. >> In your design, does not initial setup block the master? >> Does your design cover above-mentioned case? > > The way I described it does not block the master. It does defer the > point at which we can start using synchronous replication, so perhaps > that is your objection. I think it is acceptable: good food takes time > to cook. Yes. I understood your design. > IMHO it will be confusing to be transferring both old and new data at > the same time from master to slave. We will have two different processes > sending and two different processes receiving. You'll need to work > through about four times as many failure modes, all of which will need > testing. Diagnosing problems in it via the log hurts my head just > thinking about it. ISTM that will severely impact the initial robustness > of the software for this feature. Perhaps in time it is the right way. In my procedure, old WAL files are copyed by admin using scp, rsync or other external tool. So, I don't think that my procedure makes a problem more difficult. Since there are many setup cases, we should not leave all procedures to postgres, I think. > Anyway, feels like we're getting close to some good designs. There isn't > much difference between what we're discussing here. Yes. Thank you for your great ideas. -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao wrote: > I think that this case would often happen. So, we should establish a certain > solution or procedure to the case where TLI of the master doesn't match > TLI of the slave. If we only allow the case where TLI of both servers is the > same, the configuration after failover always needs to get the base backup > on the new master. It's unacceptable for many users. But, I think that it's > the role of admin or external tools to copy history files to the slave from > the master. Hmm. There's more problems than the TLI with that. For the original master to catch up by replaying WAL from the new slave, without restoring from a full backup, the original master must not write to disk *any* WAL that hasn't made it to the slave yet. That is certainly not true for asynchronous replication, but it also throws off the idea of flushing the WAL concurrently to the local disk and to the slave in synchronous mode. I agree that having to get a new base backup to get the old master catch up with the new master sucks, so I hope someone sees a way around that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Markus Wanner <markus@bluegap.ch> writes: > Tom Lane wrote: >> Sooner or later we shall have to bite the bullet and set up a >> multiplexing system to transmit multiple event types to backends with >> just one signal. We already did it for signals to the postmaster. > Agreed. However, it's non-trivial if you want reliable queues (i.e. no > message skipped, as with signals) for varying message sizes. No, that's not what I had in mind at all, just the ability to deliver one of a specified set of event notifications --- ie, get around the fact that Unix only gives us two user-definable signal types. For signals sent from other backends, it'd be sufficient to put a bitmask field into PGPROC entries, which the sender could OR bits into before sending the one "real" signal event (either SIGUSR1 or SIGUSR2). I'm not sure what to do if we need signals sent from processes that aren't connected to shared memory; but maybe we need not cross that bridge here. (Also, I gather that the Windows implementation could already support a bunch more signal types without much trouble.) regards, tom lane
Hi, Tom Lane wrote: > No, that's not what I had in mind at all, just the ability to deliver > one of a specified set of event notifications --- ie, get around the > fact that Unix only gives us two user-definable signal types. Ah, okay. And I already thought you'd like imessages :-( > For signals sent from other backends, it'd be sufficient to put a > bitmask field into PGPROC entries, which the sender could OR bits into > before sending the one "real" signal event (either SIGUSR1 or SIGUSR2). That might work for expanding the number of available signals, yes. Regards Markus Wanner
Tom Lane <tgl@sss.pgh.pa.us> writes: > I'm not sure what to do if we need signals sent from processes that > aren't connected to shared memory; but maybe we need not cross that > bridge here. Such as signals coming from the postmaster? Isn't that where most of them come from though? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB'sPostgreSQL training!
Gregory Stark wrote: > Tom Lane <tgl@sss.pgh.pa.us> writes: > >> I'm not sure what to do if we need signals sent from processes that >> aren't connected to shared memory; but maybe we need not cross that >> bridge here. > > Such as signals coming from the postmaster? Isn't that where most of them come > from though? Uh.. no, such as signals *going to* the postmaster. That's where we already have such a multiplexer in place, but not the other way around IIRC. Regards Markus Wanner
On Thu, 2008-09-11 at 18:17 +0300, Heikki Linnakangas wrote: > Fujii Masao wrote: > > I think that this case would often happen. So, we should establish a certain > > solution or procedure to the case where TLI of the master doesn't match > > TLI of the slave. If we only allow the case where TLI of both servers is the > > same, the configuration after failover always needs to get the base backup > > on the new master. It's unacceptable for many users. But, I think that it's > > the role of admin or external tools to copy history files to the slave from > > the master. > > Hmm. There's more problems than the TLI with that. For the original > master to catch up by replaying WAL from the new slave, without > restoring from a full backup, the original master must not write to disk > *any* WAL that hasn't made it to the slave yet. That is certainly not > true for asynchronous replication, but it also throws off the idea of > flushing the WAL concurrently to the local disk and to the slave in > synchronous mode. > > I agree that having to get a new base backup to get the old master catch > up with the new master sucks, so I hope someone sees a way around that. If we were going to recover from failed-over standby back to original master just via WAL logs we would need all of the WAL files from the point of failover. So you'd need to be storing all WAL file just in case the old master recovers. I can't believe doing that would be the common case, because its so impractical and most people would run out of disk space and need to delete WAL files. It should be clear that to make this work you must run with a base backup that was derived correctly on the current master. You can do that by re-copying everything, or you can do that by just shipping changed blocks (rsync etc). So I don't see a problem in the first place. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Fri, 2008-09-12 at 00:03 +0900, Fujii Masao wrote: > In my procedure, old WAL files are copyed by admin using scp, rsync > or other external tool. So, I don't think that my procedure makes a > problem more difficult. Since there are many setup cases, we should > not leave all procedures to postgres, I think. So the procedure is 1. Startup WALReceiver to begin receiving WAL 2. Do some manual stuff 3. Initiate recovery So either * WALReceiver is not started by postmaster. I don't think its acceptable that WALReceiver is not under the postmaster. You haven't reduced the number of failure modes by doing that, you've just swept the problem under the carpet and pretended its not Postgres' problem. * Postgres startup requires some form of manual process, as an **intermediate** stage. I don't think either of those is acceptable. It must just work. Why not: 1. Same procedure as Warm Standby now a) WAL archiving to standby starts b) base backup 2. Startup standby, with additional option to stream WAL. WALReceiver starts, connects to Primary. Primary issues log switch. Archiver turns itself off after sending that last file. WALSender starts streaming current WAL immediately after log switch. 3. Startup process on standby begins reading WAL from point mentioned by backup_label. When it gets to last logfile shipped by primary's archiver, it switches to reading WAL files written by WALReceiver. So all automatic. Uses existing code. Synchronous replication starts immediately. Also has the advantage that we do not get WAL bloat on primary. Configuration is almost identical to current Warm Standby, so little change for existing Postgres sysadmins. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > If we were going to recover from failed-over standby back to original > master just via WAL logs we would need all of the WAL files from the > point of failover. So you'd need to be storing all WAL file just in case > the old master recovers. I can't believe doing that would be the common > case, because its so impractical and most people would run out of disk > space and need to delete WAL files. Depends on the transaction volume and database size of course. It's actually not any different from the scenario where the slave goes offline for some reason. You have the the same decision there of how long to keep the WAL files in the master, in case the slave wakes up. I think we'll need an option to specify a maximum for the number of WAL files to keep around. The DBA should set that to the size of the WAL drive, minus some safety factor. > It should be clear that to make this work you must run with a base > backup that was derived correctly on the current master. You can do that > by re-copying everything, or you can do that by just shipping changed > blocks (rsync etc). So I don't see a problem in the first place. Hmm, built-in rsync capability would be cool. Probably not in the first phase, though.. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, 2008-09-12 at 17:08 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > > It should be clear that to make this work you must run with a base > > backup that was derived correctly on the current master. You can do that > > by re-copying everything, or you can do that by just shipping changed > > blocks (rsync etc). So I don't see a problem in the first place. > > Hmm, built-in rsync capability would be cool. Probably not in the first > phase, though.. We have it for WAL shipping, in form of GUC "archive_command" :) Why not add full_backup_command ? -------------- Hannu
On Fri, 2008-09-12 at 17:08 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > I think we'll need an option to specify a maximum for the number of WAL > files to keep around. The DBA should set that to the size of the WAL > drive, minus some safety factor. > > > It should be clear that to make this work you must run with a base > > backup that was derived correctly on the current master. You can do that > > by re-copying everything, or you can do that by just shipping changed > > blocks (rsync etc). So I don't see a problem in the first place. > > Hmm, built-in rsync capability would be cool. Probably not in the first > phase, though.. Built-in? Why? I mean make base backup using rsync. That way only changed data blocks need be migrated, so much faster. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Fri, 2008-09-12 at 17:24 +0300, Hannu Krosing wrote: > On Fri, 2008-09-12 at 17:08 +0300, Heikki Linnakangas wrote: > > Hmm, built-in rsync capability would be cool. Probably not in the first > > phase, though.. > > We have it for WAL shipping, in form of GUC "archive_command" :) > > Why not add full_backup_command ? I see the current design is all master-push centered, i.e. the master is in control of everything WAL related. That makes it hard to create a slave which is simply pointed to the server and takes all it's data from there... Why not have a design where the slave is in control for it's own data ? I mean the slave could ask for the base files (possibly through a special function deployed on the master), then ask for the WAL stream and so on. That would easily let a slave cascade too, as it could relay the WAL stream and serve the base backup too... or have a special WAL repository software with the same interface as a normal master, but having a choice of base backups and WAL streams. Plus that a slave in control approach would also allow multiple slaves at the same time for a given master... The way it would work would be something like: * configure the slave with a postgres connection to the master; * the slave will connect and set up some meta data on the master identifying itself and telling the master to keep the WAL needed by this slave, and also get some meta data about the master's details if needed; * the slave will call a special function on the slave and ask for the base backup to be streamed (potentially compressed with special knowledge of postgres internals); * once the base backup is streamed, or possibly in parallel, ask for streaming the WAL files; * when the base backup is finished, start applying the WAL stream, which is cached in the meantime, and it it's streaming continues; * keep the master updated about the state of the slave, so the master can know if it needs to keep the WAL files which were not yet streamed; * in case of network error, the slave connects again and starts to stream the WAL from where it was left; * in case of extended network outage, the master could decide to unsubscribe the slave when a certain time-out happened; * when the slave finds itself unsubscribed after a longer disconnection, it could ask for a new base backup based on differences only... some kind of built in rsync thingy; The only downside of this approach is that the slave machine needs a full postgres super user connection to the master. That could be a security problem in certain scenarios. The master-centric scenario needs a connection in the other direction, which might be seen as more secure, I don't know for sure... Cheers, Csaba.
Simon Riggs wrote: > Built-in? Why? I mean make base backup using rsync. That way only > changed data blocks need be migrated, so much faster. Yes, what I meant is that it would be cool to have that functionality built-in, so that you wouldn't need to configure extra rsync scripts and authentication etc. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > Simon Riggs wrote: >> Built-in? Why? I mean make base backup using rsync. That way only >> changed data blocks need be migrated, so much faster. > > Yes, what I meant is that it would be cool to have that functionality > built-in, so that you wouldn't need to configure extra rsync scripts > and authentication etc. > If this were a nice pluggable library I'd agree, but AFAIK it's not, and I don't see great value in reinventing the wheel. cheers andrew
Csaba Nagy wrote: > Why not have a design where the slave is in control for it's own data ? > I mean the slave could ask for the base files (possibly through a > special function deployed on the master), then ask for the WAL stream > and so on. That would easily let a slave cascade too, as it could relay > the WAL stream and serve the base backup too... or have a special WAL > repository software with the same interface as a normal master, but > having a choice of base backups and WAL streams. Plus that a slave in > control approach would also allow multiple slaves at the same time for a > given master... I totally agree with that. > The only downside of this approach is that the slave machine needs a > full postgres super user connection to the master. That could be a > security problem in certain scenarios. I think the master-slave protocol needs to be separate from the normal FE/BE protocol, with commands like "send a new base backup", or "subscribe to new WAL that's generated". A master-slave connection isn't associated with any individual database, for example. We can keep the permissions required for establishing a master-slave connection different from super-userness. In particular, while the slave will be able to read all data from the whole cluster, by receiving it in the WAL and base backups, it doesn't need to be able to modify anything in the master. > The master-centric scenario needs > a connection in the other direction, which might be seen as more secure, > I don't know for sure... Which one initiates the connection, the master or slave, is a different question. I believe we've all assumed that it's the slave that connects to the master, and I think that makes the most sense. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Hi, Andrew Dunstan wrote: > If this were a nice pluggable library I'd agree, but AFAIK it's not, and > I don't see great value in reinventing the wheel. I certainly agree. However, I thought of it more like the archive_command, as proposed by Hannu. That way we don't need to reinvent any wheel and still the standby could trigger the base data synchronization itself. Regards Markus Wanner
On Fri, 2008-09-12 at 17:11 +0200, Csaba Nagy wrote: > Why not have a design where the slave is in control for it's own data ? > I mean the slave... The slave only exists because it is a copy of the master. If you try to "startup" a slave without first having taken a copy, how would you bootstrap the slave? With what? To what? It sounds cool, but its not practical. I posted a workable suggestion today on another subthread. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Fri, 2008-09-12 at 17:45 +0100, Simon Riggs wrote: > On Fri, 2008-09-12 at 17:11 +0200, Csaba Nagy wrote: > > > Why not have a design where the slave is in control for it's own data ? > > I mean the slave... > > The slave only exists because it is a copy of the master. If you try to > "startup" a slave without first having taken a copy, how would you > bootstrap the slave? With what? To what? As I understand it, Csaba meant that slave would "bootstrap itself" by connecting to master in some early phase of startup, requesting a physical filesystem level copy of data, then commencing the startup in Hot Standby mode. If done that way, all the slave needs is a superuser level connection to master database. Of course this can also be done using little hot standby startup script from slave, if shell access to master is provided,. ------------------ Hannu
Hannu Krosing escribió: > On Fri, 2008-09-12 at 17:45 +0100, Simon Riggs wrote: > > On Fri, 2008-09-12 at 17:11 +0200, Csaba Nagy wrote: > > > > > Why not have a design where the slave is in control for it's own data ? > > > I mean the slave... > > > > The slave only exists because it is a copy of the master. If you try to > > "startup" a slave without first having taken a copy, how would you > > bootstrap the slave? With what? To what? > > As I understand it, Csaba meant that slave would "bootstrap itself" by > connecting to master in some early phase of startup, requesting a > physical filesystem level copy of data, then commencing the startup in > Hot Standby mode. Interesting ... This doesn't seem all that difficult -- all you need is to start one connection to get the WAL stream and save it somewhere; meanwhile a second connection uses a combination of pg_file_read on master + pg_file_write on slave to copy the data files over. When this step is complete, recovery of the stored WAL commences. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Fri, Sep 12, 2008 at 7:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Thu, 2008-09-11 at 18:17 +0300, Heikki Linnakangas wrote: >> Fujii Masao wrote: >> > I think that this case would often happen. So, we should establish a certain >> > solution or procedure to the case where TLI of the master doesn't match >> > TLI of the slave. If we only allow the case where TLI of both servers is the >> > same, the configuration after failover always needs to get the base backup >> > on the new master. It's unacceptable for many users. But, I think that it's >> > the role of admin or external tools to copy history files to the slave from >> > the master. >> >> Hmm. There's more problems than the TLI with that. For the original >> master to catch up by replaying WAL from the new slave, without >> restoring from a full backup, the original master must not write to disk >> *any* WAL that hasn't made it to the slave yet. That is certainly not >> true for asynchronous replication, but it also throws off the idea of >> flushing the WAL concurrently to the local disk and to the slave in >> synchronous mode. >> >> I agree that having to get a new base backup to get the old master catch >> up with the new master sucks, so I hope someone sees a way around that. > > If we were going to recover from failed-over standby back to original > master just via WAL logs we would need all of the WAL files from the > point of failover. So you'd need to be storing all WAL file just in case > the old master recovers. I can't believe doing that would be the common > case, because its so impractical and most people would run out of disk > space and need to delete WAL files. No. The original master doesn't need all WAL files. It needs WAL file which its pg_control points as latest checkpoint location and subsequent files. > It should be clear that to make this work you must run with a base > backup that was derived correctly on the current master. You can do that > by re-copying everything, or you can do that by just shipping changed > blocks (rsync etc). So I don't see a problem in the first place. PITR doesn't always need a base backup. We can do PITR from the data files just after crash if they aren't corrupted (i.e. not media crash). As the situation demands, most users would like to choose the setup procedure that bad influence on the cluster is smaller. They would choose the procedure without a base backup if there are few WAL files to be replayed. Meanwhile, they would use a base backup if the indispensable WAL files have already been deleted. But, in that case, they might not take new base backup and use old one (e.g. taken 2 days before). Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, Sep 12, 2008 at 12:17 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Fujii Masao wrote: >> >> I think that this case would often happen. So, we should establish a >> certain >> solution or procedure to the case where TLI of the master doesn't match >> TLI of the slave. If we only allow the case where TLI of both servers is >> the >> same, the configuration after failover always needs to get the base backup >> on the new master. It's unacceptable for many users. But, I think that >> it's >> the role of admin or external tools to copy history files to the slave >> from >> the master. > > Hmm. There's more problems than the TLI with that. For the original master > to catch up by replaying WAL from the new slave, without restoring from a > full backup, the original master must not write to disk *any* WAL that > hasn't made it to the slave yet. That is certainly not true for asynchronous > replication, but it also throws off the idea of flushing the WAL > concurrently to the local disk and to the slave in synchronous mode. Yes. If the master fails after writing WAL to disk and before sending it to the slave, at least latest WAL file would be inconsistent between both servers. So, regardless of using a base backup, in a setup procedure, we need to delete those inconsistent WAL files or overwrite them. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Simon Riggs escribió: > On Fri, 2008-09-12 at 17:08 +0300, Heikki Linnakangas wrote: > > > It should be clear that to make this work you must run with a base > > > backup that was derived correctly on the current master. You can do that > > > by re-copying everything, or you can do that by just shipping changed > > > blocks (rsync etc). So I don't see a problem in the first place. > > > > Hmm, built-in rsync capability would be cool. Probably not in the first > > phase, though.. > > Built-in? Why? I mean make base backup using rsync. That way only > changed data blocks need be migrated, so much faster. Why rsync? Just compare the LSNs ... -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera wrote: > Simon Riggs escribió: >> On Fri, 2008-09-12 at 17:08 +0300, Heikki Linnakangas wrote: >>>> It should be clear that to make this work you must run with a base >>>> backup that was derived correctly on the current master. You can do that >>>> by re-copying everything, or you can do that by just shipping changed >>>> blocks (rsync etc). So I don't see a problem in the first place. >>> Hmm, built-in rsync capability would be cool. Probably not in the first >>> phase, though.. >> Built-in? Why? I mean make base backup using rsync. That way only >> changed data blocks need be migrated, so much faster. > > Why rsync? Just compare the LSNs ... True, that's much better. Only works for data files, though, so we'll still need something else for clog etc. But the volume of the other stuff is much smaller, so I support we don't need to bother delta compressing them. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Fujii Masao wrote: > On Fri, Sep 12, 2008 at 12:17 AM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> Hmm. There's more problems than the TLI with that. For the original master >> to catch up by replaying WAL from the new slave, without restoring from a >> full backup, the original master must not write to disk *any* WAL that >> hasn't made it to the slave yet. That is certainly not true for asynchronous >> replication, but it also throws off the idea of flushing the WAL >> concurrently to the local disk and to the slave in synchronous mode. > > Yes. > > If the master fails after writing WAL to disk and before sending it to > the slave, > at least latest WAL file would be inconsistent between both servers. So, > regardless of using a base backup, in a setup procedure, we need to delete > those inconsistent WAL files or overwrite them. And if you're unlucky, the changes in the latest WAL file might already have been flushed to data files as well. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Simon Riggs wrote: > Why not: > 1. Same procedure as Warm Standby now > a) WAL archiving to standby starts > b) base backup > > 2. Startup standby, with additional option to stream WAL. WALReceiver > starts, connects to Primary. Primary issues log switch. Archiver turns > itself off after sending that last file. WALSender starts streaming > current WAL immediately after log switch. > > 3. Startup process on standby begins reading WAL from point mentioned by > backup_label. When it gets to last logfile shipped by primary's > archiver, it switches to reading WAL files written by WALReceiver. > > So all automatic. Uses existing code. Synchronous replication starts > immediately. Also has the advantage that we do not get WAL bloat on > primary. Configuration is almost identical to current Warm Standby, so > little change for existing Postgres sysadmins. I totally agree. Requiring the master to be down for a significant time to add a slave isn't going to keep people happy very long. We have the technology now to allow warm standby slaves by using PITR, and it seems a similar system can be used to setup slaves, and for cases when the slave drops off and has to rejoin. The slave can use the existing 'restore_command' command to pull all WAL files it needs, and then the slave needs to connect to the master and say it is ready for WAL files. The master is going to need to send perhaps everything from the start of the existing WAL file so the slave is sure to get all changes during the switch from 'restore_command' to network-passed WAL info. I can imagine the slave going in and out of network connectivity as long as the required PITR files are still available. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Tue, 2008-09-09 at 09:11 +0100, Simon Riggs wrote: > This gives us the Group Commit feature also, even if we are not using > replication. So we can drop the commit_delay stuff. > > XLogBackgroundFlush() processes data page at a time if it can. That may > not be the correct batch size for XLogBackgroundSend(), so we may need a > tunable for the MTU. Under heavy load we need the Write and Send to act > in a way to maximise throughput rather than minimise response time, as > we do now. > > If wal_buffers overflows, we continue to hold WALInsertLock while we > wait for WALWriter and WALSender to complete. > > We should increase default wal_buffers to 64. > > After (or during) XLogInsert backends will sleep in a proc queue, > similar to LWlocks and protected by a spinlock. When preparing to > write/send the WAL process should read the proc at the *tail* of the > queue to see what the next LogwrtRqst should be. Then it performs its > action and wakes procs up starting with the head of the queue. We would > add LSN into PGPROC, so WAL processes can check whether the backend > should be woken. The LSN field can be accessed without spinlocks since > it is only ever set by the backend itself and only read while a backend > is sleeping. So we access spinlock, find tail, drop spinlock then read > LSN of the backend that (was) the tail. I left off mentioning one other aspect of "Group Commit" behaviour that is possible with the above design. If we use a proc queue, then the we only wake up the *first* backend on the queue. That lets other WAL processes continue quickly. Reason for doing this is that the first backend can walk the commit queue collecting xids. When we update the ProcArray we can then update multiple backend's entries with a single request, rather than forcing all of the backends to form a queue all queueing for exclusive lock. When the first backend has updated procarray, then all backends updated will be released at once. Doing it that way will significantly reduce the number of exclusive lock requests for commits, which is the main source of contention on the procarray. So that puts in batch setting behaviour for WALWriteLock and ProcArrayLock. And I'm submitting patch for batch setting of clog entries around ClogControlLock. So we should get a scalability boost from all of this. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support