Re: Built-in Raft replication - Mailing list pgsql-hackers
From | Hannu Krosing |
---|---|
Subject | Re: Built-in Raft replication |
Date | |
Msg-id | CAMT0RQS53KGPnygnKY129BaWs0k4R4fUuq=9jhdtViNo_7PJ6A@mail.gmail.com Whole thread Raw |
In response to | Re: Built-in Raft replication (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Built-in Raft replication
|
List | pgsql-hackers |
On Wed, Apr 16, 2025 at 6:27 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Andrey Borodin <x4mmm@yandex-team.ru> writes: > > I think it's what Konstantin is proposing. To have our own Raft implementation, without dependencies. > > Hmm, OK. I thought that the proposal involved relying on some existing > code, but re-reading the thread that was said nowhere. Still, that > moves it from a large project to a really large project :-( My understanding is that RAFT is a fancy name for what PostgreSQL is largely already doing - "electing" a leader and then doing all the changes through that leader (a.k.a. WAL streaming) The thing that needs adding - and which makes it "RAFT" instead of just a streaming replication with a failover - is what happens when the leader goes away and one of the replicas needs to become a new leader. We have ways to do rollbacks and roll-forwards but the main tricky part is "how do you know that you have not lost some changes" and here we must have some place to store the info about at which LSN the failover happened, so that we know to run pg_rewind if any losts hosts come back and want to join. And of course we need to have a way to communicate this "who is the new leader" to clients which needs new libpgq functionality of "failover connection pools" which hide the failovers from old clients. The RAFT protocol could be a provider of "who is current leader info" and optionally cache the LSN the switch happened. > I continue to think that it'd be best to try to implement it as > an extension, at least up till the point of finding show-stopping > reasons why it cannot be that. The main thing I would like to see in core is ability to do clean *switchovers* (not failovers) by sending a magic WAL record with a message "hey node N, please take over the write node role" over WAL so that node N knows to self-promote and all other nodes know to start following N starting from the next WAL record either directly or Why WAL - because it is already is sent to all replicas, it guarantees continuity as it very clearly states at what LSN the write-head-switch happens. We also should be able to send this info to the client libraries currently connected to the writer. so that they can choose to switch to the new head. The rest could be easily an extension. Mainly we want more than one "coordinators" which can be running in some or all of the nodes.and who agree on - which node is current leader - at which LSN the switch happened (so if some node coming back discovers that it has magicall moved ahead it knows to rewind to that LSN and then re-stream it from commonly agreed place. It would also be nice to have some agreed, hopefully lightweight, notion of node identity, which we could then use for many things, including stamping it in WAL records to guarantee / verify that all the nodes have been on the same WAL stream all the time But regarding weather to use RAFT I would just define a "coordinator API" and leave it up to the specific coordinator/consensus extension to decide how the consensus is achieved So to summarize: # Core should provide - way tomove to new node, - for switchover a WAL-based switchover - for failover something similar which also writes the WAL record so all histories are synced - a libpq message informing clients about "new write head node" - node IDs and more general c;luster-awareness inside the PostgreSQL node (I had a shoutout about this in a recent pgconf.dev unconference talk) - a new write-node field in WAL to track write head movement - API for a joining node to find out which cluster it joins and the switchover history - in WAL it is always switchover, maybe with some info saying that it was a forces switchover because we lost old write head - if some lost node comes back it may need to rewind or re-initialize if it finds out it had been following a lost timeline that is not fully part of NOTE: switchovers in WAL would be very similar to timeline changes. I am not sure how much extra info is needed there. # Extension can provide - agreeing on new leader node in case of failover - protocol can be RAFT, PAXOS or "the DBA says so" :) - sharing fresh info about current leader and switch timelines (though this should more likely be in core) - anything else ??? # external apps is (likely?) needed for - setting up cluster, provisioning machines / VMs - setting up networking - starting PostgreSQL servers. - spinning up and down clients, - communicating current leader and replica set (could be done by DNS with agreed conventions)
pgsql-hackers by date: