Home > mailing lists

Re: Built-in Raft replication - Mailing list pgsql-hackers

From	Hannu Krosing
Subject	Re: Built-in Raft replication
Date	April 17 00:24:09
Msg-id	CAMT0RQS53KGPnygnKY129BaWs0k4R4fUuq=9jhdtViNo_7PJ6A@mail.gmail.com Whole thread Raw
In response to	Re: Built-in Raft replication (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: Built-in Raft replication
List	pgsql-hackers

Tree view

On Wed, Apr 16, 2025 at 6:27 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Andrey Borodin <x4mmm@yandex-team.ru> writes:
> > I think it's what Konstantin is proposing. To have our own Raft implementation, without dependencies.
>
> Hmm, OK.  I thought that the proposal involved relying on some existing
> code, but re-reading the thread that was said nowhere.  Still, that
> moves it from a large project to a really large project :-(

My understanding is that RAFT is a fancy name for what PostgreSQL is
largely already doing - "electing" a leader and then doing all the
changes through that leader (a.k.a. WAL streaming)

The thing that needs adding - and which makes it "RAFT" instead of
just a streaming replication with a failover - is what happens when
the leader goes away and one of the replicas needs to become a new
leader.

We have ways to do rollbacks and roll-forwards but the main tricky
part is "how do you know that you have not lost some changes" and here
we must have some place to store the info about at which LSN the
failover happened, so that we know to run pg_rewind if any losts hosts
come back and want to join.

And of course we need to have a way to communicate this "who is the
new leader" to clients which needs new libpgq functionality of
"failover connection pools" which hide the failovers from old clients.

The RAFT protocol could be a provider of "who is current leader info"
and optionally cache the LSN the switch happened.

> I continue to think that it'd be best to try to implement it as
> an extension, at least up till the point of finding show-stopping
> reasons why it cannot be that.

The main thing I would like to see in core is ability to do clean
*switchovers* (not failovers) by sending a magic WAL record with a
message "hey node N, please take over the write node role" over WAL so
that node N knows to self-promote and all other nodes know to start
following N starting  from the next WAL record either directly or

Why WAL - because it is already is sent to all replicas, it guarantees
continuity as it very clearly states at what LSN the write-head-switch
happens.

We also should be able to send this info to the client libraries
currently connected to the writer. so that they can choose to switch
to the new head.

The rest could be easily an extension.

Mainly we want more than one "coordinators" which can be running in
some or all of the nodes.and who agree on
- which node is current leader
- at which LSN the switch happened (so if some node coming back
discovers that it has magicall moved ahead it knows to rewind to that
LSN and then re-stream it from commonly agreed place.

 It would also be nice to have some agreed, hopefully lightweight,
notion of node identity, which we could then use for many things,
including stamping it in WAL records to guarantee / verify that all
the nodes have been on the same WAL stream all the time

But regarding weather to use RAFT I would just define a "coordinator
API" and leave it up to the specific coordinator/consensus extension
to decide how the consensus is achieved

So to summarize:

# Core should provide

- way tomove to new node,
  - for switchover a WAL-based switchover
  - for failover something similar which also writes the WAL record so
all histories are synced
- a libpq message informing clients about "new write head node"
- node IDs and more general c;luster-awareness inside the PostgreSQL
node (I had a shoutout about this in a recent pgconf.dev unconference
talk)
- a new write-node field in WAL to track write head movement
- API for a joining node to find out which cluster it joins and the
switchover history
  - in WAL it is always switchover, maybe with some info saying that
it was a forces switchover because we lost old write head
  - if some lost node comes back it may need to rewind or
re-initialize if it finds out it had been following a lost timeline
that is not fully part of

NOTE: switchovers in WAL would be very similar to timeline changes. I
am not sure how much extra info is needed there.

# Extension can provide
- agreeing on new leader node in case of failover
  - protocol can be RAFT, PAXOS or "the DBA says so" :)
- sharing fresh info about current leader and switch timelines (though
this should more likely be in core)
- anything else ???

# external apps is (likely?) needed for
- setting up cluster, provisioning machines / VMs
- setting up networking
- starting PostgreSQL servers.
- spinning up and down clients,
- communicating current leader and replica set (could be done by DNS
with agreed conventions)

pgsql-hackers by date:

From: Tom Lane
Date: 16 April, 22:55:09
Subject: Re: not null constraints, again

From: Tom Lane
Date: 17 April, 00:42:41
Subject: jsonapi: scary new warnings with LTO enabled

Re: Built-in Raft replication - Mailing list pgsql-hackers

Previous

Next