Home > mailing lists

Re: Built-in Raft replication - Mailing list pgsql-hackers

From	Yura Sokolov
Subject	Re: Built-in Raft replication
Date	April 17 12:50:58
Msg-id	a2341912-5bab-4e25-9277-68d67c3f7655@postgrespro.ru Whole thread Raw
In response to	Re: Built-in Raft replication (Hannu Krosing <hannuk@google.com>)
List	pgsql-hackers

Tree view

17.04.2025 00:24, Hannu Krosing пишет:
> But regarding weather to use RAFT I would just define a "coordinator
> API" and leave it up to the specific coordinator/consensus extension
> to decide how the consensus is achieved
> 
> 
> So to summarize:
> 
> # Core should provide
> 
> - way tomove to new node,
>   - for switchover a WAL-based switchover
>   - for failover something similar which also writes the WAL record so
> all histories are synced
> - a libpq message informing clients about "new write head node"
> - node IDs and more general c;luster-awareness inside the PostgreSQL
> node (I had a shoutout about this in a recent pgconf.dev unconference
> talk)
> - a new write-node field in WAL to track write head movement
> - API for a joining node to find out which cluster it joins and the
> switchover history
>   - in WAL it is always switchover, maybe with some info saying that
> it was a forces switchover because we lost old write head
>   - if some lost node comes back it may need to rewind or
> re-initialize if it finds out it had been following a lost timeline
> that is not fully part of

Great summary!

I'd like to add:
- ability to switch to specific Timeline (to match with Raft Term).
  Timeline numbering is really better to be coordinated.
  It is not so unusual when different replicas become leader in the same
Timeline with current mechanism, and it should be forbidden at any cost.

- remove of forced checkpoint at replica promotion, or make it background
instead of blocking.
  It is really huge issue reported by our support.

- as possible improvement, WAL record of "certainly replicated LSN" - ie,
LSN known to settled in quorum's WALs. RAFT uses it as an optimisation of
server start/leader promotion: it reduces amount of log to search last such
point.

> NOTE: switchovers in WAL would be very similar to timeline changes. I
> am not sure how much extra info is needed there.
> 
> # Extension can provide
> - agreeing on new leader node in case of failover
>   - protocol can be RAFT, PAXOS or "the DBA says so" :)
> - sharing fresh info about current leader and switch timelines (though
> this should more likely be in core)
> - anything else ???
> 
> # external apps is (likely?) needed for
> - setting up cluster, provisioning machines / VMs
> - setting up networking
> - starting PostgreSQL servers.
> - spinning up and down clients,
> - communicating current leader and replica set (could be done by DNS
> with agreed conventions)

-- 
regards
Yura Sokolov aka funny-falcon

pgsql-hackers by date:

From: torikoshia
Date: 17 April, 12:34:57
Subject: Re: Align memory context level numbering in pg_log_backend_memory_contexts()

From: Ashutosh Bapat
Date: 17 April, 12:52:28
Subject: Re: Changing shared_buffers without restart

Re: Built-in Raft replication - Mailing list pgsql-hackers

Previous

Next