Design for Synchronous Replication/ WAL Streaming - Mailing list pgsql-hackers
From | Simon Riggs |
---|---|
Subject | Design for Synchronous Replication/ WAL Streaming |
Date | |
Msg-id | 1219115620.5343.896.camel@ebony.2ndQuadrant Whole thread Raw |
List | pgsql-hackers |
For various reasons others have not been able to discuss detailed designs in public. In an attempt to provide assistance with that I'm providing my design notes here - not hugely detailed, just rough sketches of how it can work. This may also help identify coordination points and help to avert the code equivalent of a traffic jam later in this release cycle. Sync rep consists of 3 main parts: * WAL sending * WAL transmitting * WAL receiving WAL apply is essentially the same, so isn't discussed here. WAL sending - would be achieved by having WAL writer issue calls to transmit data. Individual backends would perform XLogInsert() to insert a commit WAL record, then queue themselves up to wait for WAL writer to perform the transmit up to the desired LSN (according to parameter settings for synchronous_commit etc). The local WAL write and WAL transmit would be performed together by the WAL writer, who would then wake up backends once the log has been written as far as the requested LSN. Very similar code to LWlocks, but queued on LSN, not lock arrival. Should be possible to make queue in strict LSN order to avoid complexity on wake-up. This then provides Group Commit feature at same time as ensuring efficient WAL transmit. WAL transmit - network layer is handled by plugin, as suggested by Itagaki/Koichi. Requirements are efficient transfer of WAL, similar configurability to other aspects of Postgres, including security. Various approaches possible * direct connect using new protocol * implement slight protocol changes into standard PostgreSQL client, similar to COPY streaming, just with slightly different initiation. Allows us to use same config, security options as now with postmaster handling initial connection. Plugin architecture allows integration with various vendor supplied options. Hopefully Postgres gets working functionality as default. WAL receiving - separate process on standby server. Started by an option in recovery.conf to receive streaming WAL rather than use files. Separation of Startup process from WALReceiver process required to ensure fast response to incoming network packets without slowing down WAL apply, which needs to go fast to keep up with stream. WALreceiver process would receive WAL and then write them to WAL buffers and also to disk in the normal WAL files. Data buffered in WAL buffers allows Startup process to read data within ReadRecord() from shared memory rather than from files, so minimising changes required for Startup process. Writing to WAL buffers also allows addition of a WAL bgreader process that can pre-fetch buffers required later for WAL apply. (That was a point of discussion previously, but its not a huge part of the design and can be added as a performance feature fairly late, if we need it). Data is written to disk to ensure the standby node can restart from last restartpoint if it should crash, re-reading all WAL files and then beginning to receive WAL from remote primary again. Files written and cleaned up in exactly same way as on normal server: keep last two restartpoints worth of xlogs, then cleanup at restartpoint time. Integration point between this and Hot Standby is around postmaster states and when the WALReceiver starts. That is the same time I expect the bgwriter to start, so I will submit patch in next few days to get that aspect sorted out. If anybody is going to refactor xlog.c to avoid collisions, it had better happen in next couple of weeks. Probably has to be Tom that does this. Suggested splits: * xlog stuff that happens in normal backends (some changes for WAL streaming) * recovery architecture stuff StartupXlog etc, checkpoints * redo apply (major changes for WAL streaming) * xlog rmgr stuff Also need to consider how the primary node acts when standby is not available. Should it hang, waiting certain time for recovery, or should it continue to run in degraded mode? Probably another parameter. Anyway, all of the above is a strawman design to assist everybody begin to understand how this might all fit together. No doubt there are other possible approaches. My personal concerns are that we minimise things that prevent various developers from working alongside each other on related features. So if the above design doesn't match what is being worked on, then at least lets examine where the integration points are, please. I hope and expect others are working on the WAL streaming design and if something occurs to prevent that then I will provide time singly/part of team to ensure this happens for 8.4. I'll be posting more design stuff over next few weeks on Hot Standby also. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
pgsql-hackers by date: