Thread: Built-in Raft replication

Built-in Raft replication

From

Konstantin Osipov

Date:

14 April, 20:15:23

Hi,

I am considering starting work on implementing a built-in Raft
replication for PostgreSQL.

Raft's advantage is that it unifies log replication, cluster
configuration/membership/topology management and initial state transfer 
into a single protocol. 

Currently the cluster configuration/topology is often managed by
Patroni, or similar tools, however, it seems there are certain
usability drawbacks with this approach: 

- it's a separate tool, requiring an external state provider like etcd;
  raft could store its configuration in system tables; this is
  also an observability improvement since everyone could look up 
  cluster state the same way as everything else

- same for watchdog; raft has a built-in failure detector that's
  configuration aware;

- flexible quorums; currently quorum size is a configurable; 
  with a built-in raft, extending the quorum could be a matter
  of starting a new node and pointing it to an existing cluster

Going forward I can see PostgreSQL providing transparent bouncing
on pg_wire level, given that Raft state is now part of the
system, so drivers and all cluster nodes could easily see where
the leader is. 

If anyone is working on Raft already I'd be happy to discuss
the details. I am fairly new to the PostgreSQL hackers ecosystem
so cautious of starting work in isolation/knowing there is no
interest in accepting the feature into the trunk.

Thanks,

-- 
Konstantin Osipov

Re: Built-in Raft replication

From

Kirill Reshke

Date:

14 April, 20:44:32

On Mon, 14 Apr 2025 at 22:15, Konstantin Osipov <kostja.osipov@gmail.com> wrote:
>
> Hi,

Hi

> I am considering starting work on implementing a built-in Raft
> replication for PostgreSQL.
>

Just some thought on top of my mind, if you need my voice here:

I have a hard time believing the community will be positive about this
change in-core. It has more changes as contrib extension. In fact, if
we want a built-in consensus algorithm, Paxos is a better option,
because you can use postgresql as local crash-safe storage for single
decree paxos, just store your state (ballot number, last voice) in a
heap table.
OTOH Raft needs to write its own log, and what's worse, it sometimes
needs to remove already written parts of it (so, it is not appended
only, unlike WAL). If you have a production system which maintains two
kinds of logs with different semantics, it is a very hard system to
maintain..

There is actually a prod-ready (non open source) implementation of
RAFT as extension, called BiHA, by pgpro.

Just some thought on top of my mind, if you need my voice here.

-- 
Best regards,
Kirill Reshke

Re: Built-in Raft replication

From

Konstantin Osipov

Date:

14 April, 22:00:22

* Kirill Reshke <reshkekirill@gmail.com> [25/04/14 20:48]:
> > I am considering starting work on implementing a built-in Raft
> > replication for PostgreSQL.
> >
> 
> Just some thought on top of my mind, if you need my voice here:
> 
> I have a hard time believing the community will be positive about this
> change in-core. It has more changes as contrib extension. In fact, if
> we want a built-in consensus algorithm, Paxos is a better option,
> because you can use postgresql as local crash-safe storage for single
> decree paxos, just store your state (ballot number, last voice) in a
> heap table.

But Raft is a log replication algorithm, not a consensus
algorithm. It does use consensus, but that's for leader election.
Paxos could be used for log replication, but that would be
expensive. In fact etcd uses Raft, and etcd is used by Patroni. So
I completely lost your line of thought here.

> OTOH Raft needs to write its own log, and what's worse, it sometimes
> needs to remove already written parts of it (so, it is not appended
> only, unlike WAL). If you have a production system which maintains two
> kinds of logs with different semantics, it is a very hard system to
> maintain..

My proposal is exactly to replace (or rather, extend) the current
synchronous log replication with Raft. Entry removal is possible to
stack on top of append-only format, and production implementations
exist which do that.

So, no, it's a single log, and in fact the current WAL will do.

> There is actually a prod-ready (non open source) implementation of
> RAFT as extension, called BiHA, by pgpro.

My guess biha is an extension since a proprietary code is easier
to maintain that way. I'd rather say the fact that there is a
proprietary implementation out in the field confirms it could be a
good idea to have it in PostgreSQL trunk. 

In any case I'm interested in contributing to the trunk, not
building a proprietary module/fork.

-- 
Konstantin Osipov

Re: Built-in Raft replication

From

Yura Sokolov

Date:

15 April, 11:57:24

14.04.2025 20:44, Kirill Reshke пишет:
> OTOH Raft needs to write its own log, and what's worse, it sometimes
> needs to remove already written parts of it (so, it is not appended
> only, unlike WAL). If you have a production system which maintains two
> kinds of logs with different semantics, it is a very hard system to
> maintain..

Raft is log replication protocol which uses log position and term.
But... PostgreSQL already have log position and term in its WAL structure.
PostgreSQL's timeline is actually the Term.
Raft implementer needs just to correct rules for Term/Timeline switching:
- instead of "next TimeLine number is just increment of largest known
TimeLine number" it needs to be "next TimeLine number is the result of
Leader Election".

And yes, "it sometimes needs to remove already written parts of it".
But... It is exactly what every PostgreSQL's cluster manager software have
to do to join previous leader as a follower to new leader - pg_rewind.

So, PostgreSQL already have 70-90%% of Raft implementation details.
Raft doesn't have to be implemented in PostgreSQL.
Raft has to be finished!!!

PS: One of the biggest issues is forced snapshot on replica promotion. It
really slows down leader switch time. It looks like it is not really
needed, or some small workaround should be enough.

-- 
regards
Yura Sokolov aka funny-falcon

Re: Built-in Raft replication

From

Aleksander Alekseev

Date:

15 April, 13:20:11

Hi Konstantin,

> I am considering starting work on implementing a built-in Raft
> replication for PostgreSQL.

Generally speaking I like the idea. The more important question IMO is
whether we want to maintain Raft within the PostgreSQL core project.

Building distributed systems on commodity hardware was a popular idea
back in the 2000s. These days you can rent a server with 2 Tb of RAM
for something like 2000 USD/month (numbers from my memory that were
valid ~5 years ago) which will fit many of the existing businesses (!)
in memory. And you can rent another one for a replica, just in order
not to recover from a backup if something happens to your primary
server. The common wisdom is if you can avoid building distributed
systems, don't build one.

Which brings the question if we want to maintain something like this
(which will include logic for cases when a node joins or leaves the
cluster, proxy server / service discovery for clients, test cases /
infrastructure for all this and also upgrading the cluster, docs, ...)
for a presumably view users which business doesn't fit in a single
server *and* they want an automatic failover (not the manual one)
*and* they don't use Patroni/Stolon/CockroachDB/Neon/... already.

Although the idea is tempting personally I'm inclined to think that
it's better to invest community resources into something else.

-- 
Best regards,
Aleksander Alekseev

Re: Built-in Raft replication

From

Yura Sokolov

Date:

15 April, 14:01:45

15.04.2025 13:20, Aleksander Alekseev пишет:
> Hi Konstantin,
> 
>> I am considering starting work on implementing a built-in Raft
>> replication for PostgreSQL.
> 
> Generally speaking I like the idea. The more important question IMO is
> whether we want to maintain Raft within the PostgreSQL core project.
> 
> Building distributed systems on commodity hardware was a popular idea
> back in the 2000s. These days you can rent a server with 2 Tb of RAM
> for something like 2000 USD/month (numbers from my memory that were
> valid ~5 years ago) which will fit many of the existing businesses (!)
> in memory. And you can rent another one for a replica, just in order
> not to recover from a backup if something happens to your primary
> server. The common wisdom is if you can avoid building distributed
> systems, don't build one.
> 
> Which brings the question if we want to maintain something like this
> (which will include logic for cases when a node joins or leaves the
> cluster, proxy server / service discovery for clients, test cases /
> infrastructure for all this and also upgrading the cluster, docs, ...)
> for a presumably view users which business doesn't fit in a single
> server *and* they want an automatic failover (not the manual one)
> *and* they don't use Patroni/Stolon/CockroachDB/Neon/... already.
> 
> Although the idea is tempting personally I'm inclined to think that
> it's better to invest community resources into something else.

Raft is not for "commodity hardware". It is for reliability.
Yes, it needs 3 servers instead of 2. It costs more than simple replication
with "manual" failover.

But if business needs high availability, it wouldn't rely on "manual"
failover. And if business relies on correctness, it would rely on any
solution which "automatically switches between two replicas". Because there
is no way to guarantee correctness with just two replicas. And many stories
of lost transactions with Patroni/Stolon already confirms this thesis.

CockroachDB/Neon - they are good solutions for distributed systems. But, as
you've said, many clients don't need distributed systems. They just need
reliable replication.

I've been working in a company which uses MongoDB (3.6 and up) as their
primary storage. And it seemed to me as "God Send". Everything just worked.
Replication was as reliable as one could imagine. It outlives several
hardware incidents without manual intervention. It allowed cluster
maintenance (software and hardware upgrades) without application downtime.
I really dream PostgreSQL will be as reliable as MongoDB without need of
external services.

-- 
regards
Yura Sokolov aka funny-falcon

Re: Built-in Raft replication

From

Konstantin Osipov

Date:

15 April, 14:14:35

* Yura Sokolov <y.sokolov@postgrespro.ru> [25/04/15 12:02]:

> > OTOH Raft needs to write its own log, and what's worse, it sometimes
> > needs to remove already written parts of it (so, it is not appended
> > only, unlike WAL). If you have a production system which maintains two
> > kinds of logs with different semantics, it is a very hard system to
> > maintain..
> 
> Raft is log replication protocol which uses log position and term.
> But... PostgreSQL already have log position and term in its WAL structure.
> PostgreSQL's timeline is actually the Term.
> Raft implementer needs just to correct rules for Term/Timeline switching:
> - instead of "next TimeLine number is just increment of largest known
> TimeLine number" it needs to be "next TimeLine number is the result of
> Leader Election".
> 
> And yes, "it sometimes needs to remove already written parts of it".
> But... It is exactly what every PostgreSQL's cluster manager software have
> to do to join previous leader as a follower to new leader - pg_rewind.
> 
> So, PostgreSQL already have 70-90%% of Raft implementation details.
> Raft doesn't have to be implemented in PostgreSQL.
> Raft has to be finished!!!
> 
> PS: One of the biggest issues is forced snapshot on replica promotion. It
> really slows down leader switch time. It looks like it is not really
> needed, or some small workaround should be enough.

I'd say my pet peeve is storing the cluster topology (the so
called raft configuration) inside the database, not in an external
state provider. Agree on other points.

-- 
Konstantin Osipov

Re: Built-in Raft replication

From

Aleksander Alekseev

Date:

15 April, 14:15:36

Hi Yura,

> I've been working in a company which uses MongoDB (3.6 and up) as their
> primary storage. And it seemed to me as "God Send". Everything just worked.
> Replication was as reliable as one could imagine. It outlives several
> hardware incidents without manual intervention. It allowed cluster
> maintenance (software and hardware upgrades) without application downtime.
> I really dream PostgreSQL will be as reliable as MongoDB without need of
> external services.

I completely understand. I had exactly the same experience with
Stolon. Everything just worked. And the setup took like 5 minutes.

It's a pity this project doesn't seem to get as much attention as
Patroni. Probably because attention requires traveling and presenting
the project at conferences which costs money. Or perhaps people are
just happy with Patroni. I'm not sure in which state Stolon is today.

-- 
Best regards,
Aleksander Alekseev

Re: Built-in Raft replication

From

Yura Sokolov

Date:

15 April, 14:20:24

15.04.2025 14:15, Aleksander Alekseev пишет:
> Hi Yura,
> 
>> I've been working in a company which uses MongoDB (3.6 and up) as their
>> primary storage. And it seemed to me as "God Send". Everything just worked.
>> Replication was as reliable as one could imagine. It outlives several
>> hardware incidents without manual intervention. It allowed cluster
>> maintenance (software and hardware upgrades) without application downtime.
>> I really dream PostgreSQL will be as reliable as MongoDB without need of
>> external services.
> 
> I completely understand. I had exactly the same experience with
> Stolon. Everything just worked. And the setup took like 5 minutes.
> 
> It's a pity this project doesn't seem to get as much attention as
> Patroni. Probably because attention requires traveling and presenting
> the project at conferences which costs money. Or perhaps people are
> just happy with Patroni. I'm not sure in which state Stolon is today.

But the key point: if PostgreSQL will be improved a bit, there will be no
need neither in Patroni, nor in Stolon. Isn't it great?

-- 
regards
Yura Sokolov aka funny-falcon

Re: Built-in Raft replication

From

Konstantin Osipov

Date:

15 April, 14:23:14

* Aleksander Alekseev <aleksander@timescale.com> [25/04/15 13:20]:
> > I am considering starting work on implementing a built-in Raft
> > replication for PostgreSQL.
> 
> Generally speaking I like the idea. The more important question IMO is
> whether we want to maintain Raft within the PostgreSQL core project.
> 
> Building distributed systems on commodity hardware was a popular idea
> back in the 2000s. These days you can rent a server with 2 Tb of RAM
> for something like 2000 USD/month (numbers from my memory that were
> valid ~5 years ago) which will fit many of the existing businesses (!)
> in memory. And you can rent another one for a replica, just in order
> not to recover from a backup if something happens to your primary
> server. The common wisdom is if you can avoid building distributed
> systems, don't build one.
> 
> Which brings the question if we want to maintain something like this
> (which will include logic for cases when a node joins or leaves the
> cluster, proxy server / service discovery for clients, test cases /
> infrastructure for all this and also upgrading the cluster, docs, ...)
> for a presumably view users which business doesn't fit in a single
> server *and* they want an automatic failover (not the manual one)
> *and* they don't use Patroni/Stolon/CockroachDB/Neon/... already.
> 
> Although the idea is tempting personally I'm inclined to think that
> it's better to invest community resources into something else.

My personal take away from this as a community member would be
seamless coordinator failover in Greenplum and all of its forks
(CloudBerry, Greengage, synxdata, what not). I also imagine there
is a number of PostgreSQL derivatives that could benefit from
built-in transparent failover since it standardizes the solution
space.

-- 
Konstantin Osipov

Re: Built-in Raft replication

From

Konstantin Osipov

Date:

15 April, 14:24:33

* Yura Sokolov <y.sokolov@postgrespro.ru> [25/04/15 14:02]:
> I've been working in a company which uses MongoDB (3.6 and up) as their
> primary storage. And it seemed to me as "God Send". Everything just worked.
> Replication was as reliable as one could imagine. It outlives several
> hardware incidents without manual intervention. It allowed cluster
> maintenance (software and hardware upgrades) without application downtime.
> I really dream PostgreSQL will be as reliable as MongoDB without need of
> external services.

thanks for pointing out mongodb, so built-in raft would help
ferretdb as well.

-- 
Konstantin Osipov

Re: Built-in Raft replication

From

Greg Sabino Mullane

Date:

15 April, 18:07:42

On Mon, Apr 14, 2025 at 1:15 PM Konstantin Osipov <kostja.osipov@gmail.com> wrote:

If anyone is working on Raft already I'd be happy to discuss
the details. I am fairly new to the PostgreSQL hackers ecosystem
so cautious of starting work in isolation/knowing there is no
interest in accepting the feature into the trunk.

Putting aside the technical concerns about this specific idea, it's best to start by laying out a very detailed plan of what you would want to change, and what you see as the costs and benefits. It's also extremely helpful to think about developing this as an extension. If you get stuck due to extension limitations, propose additional hooks. If the hooks will not work, explain why.

Getting this into core is going to be a long, multi-year effort, in which people are going to be pushing back the entire time, so prepare yourself for that. My immediate retort is going to be: why would we add this if there are existing tools that already do the job just fine? Postgres has lots of tasks that it is happy to let other programs/OS subsystems/extensions/etc. handle instead.

Cheers,

Greg

Crunchy Data - https://www.crunchydata.com

Enterprise Postgres Software Products & Tech Support

Re: Built-in Raft replication

From

Konstantin Osipov

Date:

15 April, 20:27:34

* Greg Sabino Mullane <htamfids@gmail.com> [25/04/15 18:08]:

> > If anyone is working on Raft already I'd be happy to discuss
> > the details. I am fairly new to the PostgreSQL hackers ecosystem
> > so cautious of starting work in isolation/knowing there is no
> > interest in accepting the feature into the trunk.
> >
> 
> Putting aside the technical concerns about this specific idea, it's best to
> start by laying out a very detailed plan of what you would want to change,
> and what you see as the costs and benefits. It's also extremely helpful to
> think about developing this as an extension. If you get stuck due to
> extension limitations, propose additional hooks. If the hooks will not
> work, explain why.
> 
> Getting this into core is going to be a long, multi-year effort, in which
> people are going to be pushing back the entire time, so prepare yourself
> for that. My immediate retort is going to be: why would we add this if
> there are existing tools that already do the job just fine? Postgres has
> lots of tasks that it is happy to let other programs/OS
> subsystems/extensions/etc. handle instead.

I had hoped I explained why external state providers can not
provide the same seamless UX as built-in ones. The key idea is to
have a built-in configuration management, so that adding and
removing replicas does not require changes in multiple disjoint
parts of the installation (server configurations, proxies,
clients).

I understand and accept that it's a multi-year effort, but I do
not accept the retort - my main point is that external tools
are not a replacement, and I'd like to reach consensus on that.

-- 
Konstantin Osipov, Moscow, Russia

Re: Built-in Raft replication

From

Nikolay Samokhvalov

Date:

16 April, 01:27:56

On Tue, Apr 15, 2025 at 8:08 AM Greg Sabino Mullane <htamfids@gmail.com> wrote:

On Mon, Apr 14, 2025 at 1:15 PM Konstantin Osipov <kostja.osipov@gmail.com> wrote:
If anyone is working on Raft already I'd be happy to discuss
the details. I am fairly new to the PostgreSQL hackers ecosystem
so cautious of starting work in isolation/knowing there is no
interest in accepting the feature into the trunk.

Putting aside the technical concerns about this specific idea, it's best to start by laying out a very detailed plan of what you would want to change, and what you see as the costs and benefits. It's also extremely helpful to think about developing this as an extension. If you get stuck due to extension limitations, propose additional hooks. If the hooks will not work, explain why.

This is exactly what I wanted to write as well. The idea is great. At the same time, I think, consensus on many decisions will be extremely hard to reach, so this project has a high risk of being very long. Unless it's an extension, at least in the beginning.

Nik

Re: Built-in Raft replication

From

Tom Lane

Date:

16 April, 02:19:42

Nikolay Samokhvalov <nik@postgres.ai> writes:
> This is exactly what I wanted to write as well. The idea is great. At the
> same time, I think, consensus on many decisions will be extremely hard to
> reach, so this project has a high risk of being very long. Unless it's an
> extension, at least in the beginning.

Yeah.  The two questions you'd have to get past to get this into PG
core are:

1. Why can't it be an extension?  (You claimed it would work more
seamlessly in core, but I don't think you've made a proven case.)

2. Why depend on Raft rather than some other project?

Longtime PG developers are going to be particularly hard on point 2,
because we have a track record now of outliving outside projects
that we thought we could rely on.  One example here is the Snowball
stemmer; while its upstream isn't quite dead, it's twitching only
feebly, and seems to have a bus factor of 1.  Another example is the
Spencer regex engine; we thought we could depend on Tcl to be the
upstream for that, but for a decade or more they've acted as though
*we* are the upstream.  And then there's libxml2.  And uuid-ossp.
And autoconf.  And various documentation toolchains.  Need I go on?

The great advantage of implementing an outside dependency in an
extension is that if the depended-on project dies, we can say a few
words of mourning and move on.  It's a lot harder to walk away from
in-core features.

            regards, tom lane

Re: Built-in Raft replication

From

Andrey Borodin

Date:

16 April, 07:07:28

> On 16 Apr 2025, at 04:19, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> feebly, and seems to have a bus factor of 1.  Another example is the
> Spencer regex engine; we thought we could depend on Tcl to be the
> upstream for that, but for a decade or more they've acted as though
> *we* are the upstream.

I think it's what Konstantin is proposing. To have our own Raft implementation, without dependencies.

IMO to better understand what is proposed we need some more description of proposed systems. How the new system will be
configured?initdb and what than? How new node joins cluster? What is running pg_rewind when necessary? 

Some time ago Peter E proposed to be able to start replication atop of empty directory, so that initial sync would be
morestraightforward. And also Heikki proposed to remove archive race condition when choosing new timeline. I think this
stepsare gradual movement in the same direction. 

My view is what Konstantin wants is automatic replication topology management. For some reason this technology is
calledHA, DCS, Raft, Paxos and many other scary words. But basically it manages primary_conn_info of some nodes to
providesome fault-tolerance properties. I'd start to design from here, not from Raft paper. 

Best regards, Andrey Borodin.

Re: Built-in Raft replication

From

Tom Lane

Date:

16 April, 07:26:53

Andrey Borodin <x4mmm@yandex-team.ru> writes:
> I think it's what Konstantin is proposing. To have our own Raft implementation, without dependencies.

Hmm, OK.  I thought that the proposal involved relying on some existing
code, but re-reading the thread that was said nowhere.  Still, that
moves it from a large project to a really large project :-(

I continue to think that it'd be best to try to implement it as
an extension, at least up till the point of finding show-stopping
reasons why it cannot be that.

            regards, tom lane

Re: Built-in Raft replication

From

Ashutosh Bapat

Date:

16 April, 07:33:15

On Wed, Apr 16, 2025 at 9:37 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> My view is what Konstantin wants is automatic replication topology management. For some reason this technology is
calledHA, DCS, Raft, Paxos and many other scary words. But basically it manages primary_conn_info of some nodes to
providesome fault-tolerance properties. I'd start to design from here, not from Raft paper. 
>

In my experience, the load of managing hundreds of replicas which all
participate in RAFT protocol becomes more than regular transaction
load. So making every replica a RAFT participant will affect the
ability to deploy hundreds of replica. We may build an extension which
has a similar role in PostgreSQL world as zookeeper in Hadoop. It can
be then used for other distributed systems as well - like shared
nothing clusters based on FDW. There's already a proposal to bring
CREATE SERVER to the world of logical replication - so I see these two
worlds uniting in future. The way I imagine it is some PostgreSQL
instances, which have this extension installed, will act as a RAFT
cluster (similar to Zookeeper ensemble or etcd cluster). The
distributed system based on logical replication or FDW or both will
use this ensemble to manage its shared state. The same ensemble can be
shared across multiple distributed clusters if it has scaling
capabilities.

--
Best Wishes,
Ashutosh Bapat

Re: Built-in Raft replication

From

Andrey Borodin

Date:

16 April, 07:58:55

> On 16 Apr 2025, at 09:33, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
>
> In my experience, the load of managing hundreds of replicas which all
> participate in RAFT protocol becomes more than regular transaction
> load. So making every replica a RAFT participant will affect the
> ability to deploy hundreds of replica.

No need to make all standbys voting. And no need to make plain topology. pg_consul is using 2/3 or 3/5 HA groups, and
cascadesall others from HA group. 
Existing tools already solve the original problem, Konstantin is just proposing to solve it in some standard “official”
way.

> We may build an extension which
> has a similar role in PostgreSQL world as zookeeper in Hadoop.

Patroni, pg_consul and others already use zookeeper, etcd and similar systems for consensus.
Is it any better as extension than as etcd?

> It can
> be then used for other distributed systems as well - like shared
> nothing clusters based on FDW.

I didn’t get FDW analogy. Why other distributed systems should choose Postgres extension over Zookeeper?

> There's already a proposal to bring
> CREATE SERVER to the world of logical replication - so I see these two
> worlds uniting in future.

Again, I’m lost here. Which two worlds?

> The way I imagine it is some PostgreSQL
> instances, which have this extension installed, will act as a RAFT
> cluster (similar to Zookeeper ensemble or etcd cluster).

That’s exactly what is proposed here.

> The
> distributed system based on logical replication or FDW or both will
> use this ensemble to manage its shared state. The same ensemble can be
> shared across multiple distributed clusters if it has scaling
> capabilities.

Yes, shared DCS are common these days. AFAIK, we use one Zookeeper instance per hundred Postgres clusters to coordinate
pg_consuls.

Actually, scalability is opposite to topic of this thread. Let me explain.
Currently, Postgres automatic failover tools rely on databases with built-in automatic failover. Konstantin is
proposingto shorten this loop and make Postgres use its build-in automatic failover. 

So, existing tooling allows you to have 3 hosts for DCS, with majority of 2 hosts able to elect new leader in case of
failover.
And you can have only 2 hosts for Postgres - Primary and Standby. You can have 2 big Postgres machines with 64 CPUs.
And3 one-CPU hosts for Zookeper\etcd. 

If you use build-in failover you have to resort to 3 big Postgres machines because you need 2/3 majority. Of course,
youcan install MySQL-stype arbiter - host that had no real PGDATA, only participates in voting. But this is a solution
toproblem induced by built-in autofailover. 

Best regards, Andrey Borodin.

Re: Built-in Raft replication

From

Andrey Borodin

Date:

16 April, 08:24:48

> On 16 Apr 2025, at 09:26, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Andrey Borodin <x4mmm@yandex-team.ru> writes:
>> I think it's what Konstantin is proposing. To have our own Raft implementation, without dependencies.
>
> Hmm, OK.  I thought that the proposal involved relying on some existing
> code, but re-reading the thread that was said nowhere.  Still, that
> moves it from a large project to a really large project :-(
>
> I continue to think that it'd be best to try to implement it as
> an extension, at least up till the point of finding show-stopping
> reasons why it cannot be that.

I think I can provide some reasons why it cannot be neither extension, nor any part running within postmaster reign.

1. When joining cluster, there’s not PGDATA to run postmaster on top of it.

2. After failover, old Primary node must rejoin cluster by running pg_rewind and following timeline switch.

The system in hand must be able to manipulate with PGDATA without starting Postgres.

My question to Konstantin is Why wouldn’t you just add Raft to Patroni? Is there a reason why something like Patroni is
notin core and noone rushes to get it in? 
Everyone is using it, or system like it.

Best regards, Andrey Borodin.

Re: Built-in Raft replication

From

Kirill Reshke

Date:

16 April, 08:39:07

On Wed, 16 Apr 2025 at 10:25, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> I think I can provide some reasons why it cannot be neither extension, nor any part running within postmaster reign.
>
> 1. When joining cluster, there’s not PGDATA to run postmaster on top of it.

You can join the cluster on pg_basebackup of its master; So I dont get
why this is an anti-extension restriction.

> 2. After failover, old Primary node must rejoin cluster by running pg_rewind and following timeline switch.

You can run bash from extension, what's the point?

> The system in hand must be able to manipulate with PGDATA without starting Postgres.

--
Best regards,
Kirill Reshke

Re: Built-in Raft replication

From

Andrey Borodin

Date:

16 April, 08:44:13


> On 16 Apr 2025, at 10:39, Kirill Reshke <reshkekirill@gmail.com> wrote:
> 
> You can run bash from extension, what's the point?

You cannot run bash that will stop backend running bash.


Best regards, Andrey Borodin.

Re: Built-in Raft replication

From

Ashutosh Bapat

Date:

16 April, 09:18:05

On Wed, Apr 16, 2025 at 10:29 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> > We may build an extension which
> > has a similar role in PostgreSQL world as zookeeper in Hadoop.
>
> Patroni, pg_consul and others already use zookeeper, etcd and similar systems for consensus.
> Is it any better as extension than as etcd?

I feel so. An extension runs from within a postgresql process, uses
the same protocol as PostgreSQL whereas etcd is another process and
another protocol.

>
> > It can
> > be then used for other distributed systems as well - like shared
> > nothing clusters based on FDW.
>
> I didn’t get FDW analogy. Why other distributed systems should choose Postgres extension over Zookeeper?

By other distributed systems I mean PostgreSQL distributed systems -
FDW based native sharding or native replication or a system which uses
both.

>
> > There's already a proposal to bring
> > CREATE SERVER to the world of logical replication - so I see these two
> > worlds uniting in future.
>
> Again, I’m lost here. Which two worlds?

Logical replication and FDW based native sharding.

>
> > The
> > distributed system based on logical replication or FDW or both will
> > use this ensemble to manage its shared state. The same ensemble can be
> > shared across multiple distributed clusters if it has scaling
> > capabilities.
>
> Yes, shared DCS are common these days. AFAIK, we use one Zookeeper instance per hundred Postgres clusters to
coordinatepg_consuls. 
>
> Actually, scalability is opposite to topic of this thread. Let me explain.
> Currently, Postgres automatic failover tools rely on databases with built-in automatic failover. Konstantin is
proposingto shorten this loop and make Postgres use its build-in automatic failover. 
>
> So, existing tooling allows you to have 3 hosts for DCS, with majority of 2 hosts able to elect new leader in case of
failover.
> And you can have only 2 hosts for Postgres - Primary and Standby. You can have 2 big Postgres machines with 64 CPUs.
And3 one-CPU hosts for Zookeper\etcd. 
>
> If you use build-in failover you have to resort to 3 big Postgres machines because you need 2/3 majority. Of course,
youcan install MySQL-stype arbiter - host that had no real PGDATA, only participates in voting. But this is a solution
toproblem induced by built-in autofailover. 

Users find it a waste of resources to deploy 3 big PostgreSQL
instances just for HA where 2 suffice even if they deploy 3
lightweight DCS instances. Having only some of the nodes act as DCS
and others purely PostgreSQL nodes will reduce waste of resources.

--
Best Wishes,
Ashutosh Bapat

Re: Built-in Raft replication

From

Andrey Borodin

Date:

16 April, 09:27:13

> On 16 Apr 2025, at 11:18, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
>
> Having only some of the nodes act as DCS
> and others purely PostgreSQL nodes will reduce waste of resources.

But typically you need more DCS nodes than PostgreSQL nodes. Did you mean
“Having only some of nodes act as PostgreSQL and others purely DCS nodes will reduce waste of resources”?

Best regards, Andrey Borodin.

Re: Built-in Raft replication

From

Ashutosh Bapat

Date:

16 April, 09:45:44

On Wed, Apr 16, 2025 at 11:57 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
>
>
> > On 16 Apr 2025, at 11:18, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
> >
> > Having only some of the nodes act as DCS
> > and others purely PostgreSQL nodes will reduce waste of resources.
>
> But typically you need more DCS nodes than PostgreSQL nodes. Did you mean

In a small HA setup this might be true. But not when there are many
replicas. But ...

> “Having only some nodes act as PostgreSQL and others purely DCS nodes will reduce waste of resources”?

I mean, whatever the setup may be one shouldn't require to deploy a
big PostgreSQL server just because DCS needs majority.

--
Best Wishes,
Ashutosh Bapat

Re: Built-in Raft replication

From

Michael Banck

Date:

16 April, 10:50:45

Hi,

On Wed, Apr 16, 2025 at 10:24:48AM +0500, Andrey Borodin wrote:
> I think I can provide some reasons why it cannot be neither extension,
> nor any part running within postmaster reign.
> 
> 1. When joining cluster, there’s not PGDATA to run postmaster on top
> of it.
> 
> 2. After failover, old Primary node must rejoin cluster by running
> pg_rewind and following timeline switch.
> 
> The system in hand must be able to manipulate with PGDATA without
> starting Postgres.

Yeah, while you could maybe implement some/all of the RAFT protocol in
an extension, actually building something useful on top with regards to
high availability or distributed whatever does not look feasible.

> My question to Konstantin is Why wouldn’t you just add Raft to
> Patroni?

Patroni can use pysyncobj, which is a Python implementation of RAFT, so
then you do not need an external RAFT provider like etcd, consul or
zookeeper. However, it is deemed deprecated by the Patroni authors due
to being difficult to debug when it breaks.

I guess a better Python implementation of RAFT for Patroni to use or
Patroni to implement it itself would help, but I believe nobody is
working on the latter right now, nor has any plans to do so. And there
also does not seem to be anybody working on a better pysyncobj.

> Is there a reason why something like Patroni is not in core and noone
> rushes to get it in?  Everyone is using it, or system like it.

Well, Patroni is written in Python, for starters.  It also does a lot
more than just leader election / cluster config. So I think nobody
seriously thought about proposing to put Patroni into core so far.

I guess the current proposal tries to be a step into the "something like
Patroni in core" if you tilt your head a little. It's just that the
whole thing would be a really big step for Postgres, maybe similar to
deciding we want in-core replication way back when.

Michael

Re: Built-in Raft replication

From

Konstantin Osipov

Date:

16 April, 12:47:00

* Tom Lane <tgl@sss.pgh.pa.us> [25/04/16 11:05]:
> Nikolay Samokhvalov <nik@postgres.ai> writes:
> > This is exactly what I wanted to write as well. The idea is great. At the
> > same time, I think, consensus on many decisions will be extremely hard to
> > reach, so this project has a high risk of being very long. Unless it's an
> > extension, at least in the beginning.
> 
> Yeah.  The two questions you'd have to get past to get this into PG
> core are:
> 
> 1. Why can't it be an extension?  (You claimed it would work more
> seamlessly in core, but I don't think you've made a proven case.)

I think this can be best addressed when the discussion moves on to
an architecture design record, where the UX and implementation
details are outlined. I'm sure there can be a lot of bike-shedding
on that part. For now I merely wanted to know if:
- maybe there is a reason this will never be accepted
- maybe someone is already working on this.

From the replies I sense that while there is quite a bit of
scepticism about it ever making its way into the trunk, generally
there is no aversion to it. If my understanding is right,
it's a decent start.

> 2. Why depend on Raft rather than some other project?
> 
> Longtime PG developers are going to be particularly hard on point 2,
> because we have a track record now of outliving outside projects
> that we thought we could rely on.  One example here is the Snowball
> stemmer; while its upstream isn't quite dead, it's twitching only
> feebly, and seems to have a bus factor of 1.  Another example is the
> Spencer regex engine; we thought we could depend on Tcl to be the
> upstream for that, but for a decade or more they've acted as though
> *we* are the upstream.  And then there's libxml2.  And uuid-ossp.
> And autoconf.  And various documentation toolchains.  Need I go on?
> 
> The great advantage of implementing an outside dependency in an
> extension is that if the depended-on project dies, we can say a few
> words of mourning and move on.  It's a lot harder to walk away from
> in-core features.

Raft is an algorithm, not a library. For a quick start the project
could use an existing library - I'd pick tidb's raft-rs, which
happens to be implemented in Rust, but going forward I'd guess the
community will want to have a plain C implementation. 

There is a plethora of C implementations out there, but in my
somewhat educated opinion none are good enough for PostgreSQL
standards or purposes: ideally the protocol should be fully
isolated from storage and transport and extensively tested,
randomized & injection tests being a priority. Most of C
implementation I've seen are built by enthusiasts as a
self-education projects.

So at some point the project will need its own Raft
implementation. Good news is that the design of Raft internals
has been fairly well polished in all of the various
implementations in many different programming languages, so 
it should be a fairly straightforward job.

Regarding the maintenance, since its first publishing back in ~2010
the protocol stabilized quite a bit. The core of the protocol
doesn't get many changes, I'd say nearly no changes, and it's also 
noticeable in implementations, e.g. etcd-raft, raft-rs from tikv, etc
don't get many new commits nowadays.

Now a more broad question is whether or not Raft is an optimal
long term solution for log replication? Generally Raft is
leader-based, so in theory it could be replaced with a leader-less
protocol - e.g. FastPaxos, EPaxos, and newer
developments on top of those. To the best of my understanding all 
leader-less algorithms which provide a single round-trip commit cost
require co-designing the transaction and replication
layer - which may be a way more intrusive change than adding raft
on top of the existing synchronous replication in PostgreSQL.

Given that Raft already provides an amortized single-round-trip
commit time, and the goal is simplicity of UX and unification,
I'd say it's wise to wait and see for the leader-less approaches
to mature.

At the end of the day, there is always a trade-off of trying to 
do something today and waiting for perfection, but in case of Raft
in my personal opinion the balance is just right.

-- 
Konstantin Osipov, Moscow, Russia

Re: Built-in Raft replication

From

Konstantin Osipov

Date:

16 April, 12:53:09

* Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> [25/04/16 11:06]:
> > My view is what Konstantin wants is automatic replication topology management. For some reason this technology is
calledHA, DCS, Raft, Paxos and many other scary words. But basically it manages primary_conn_info of some nodes to
providesome fault-tolerance properties. I'd start to design from here, not from Raft paper.
 
> >
> In my experience, the load of managing hundreds of replicas which all
> participate in RAFT protocol becomes more than regular transaction
> load. So making every replica a RAFT participant will affect the
> ability to deploy hundreds of replica.

I think this experience needs to be detailed out. There are
implementations in the field that are less efficient than others.

Early etcd-raft didn't have pre-voting and had "bastardized" 
(their own definition) implementation of configuration changes
which didn't use joint consensus.

Then there is a liveness issue if leader election is implemented
in a straightforward way in large clusters. But this is addressed:
scaling up the randomized election timeout with the cluster size,
converting most of participants to non-voters in large clusters. 

Raft replication, again, if implemented in a naive way, would
require a O(outstanding transaction) * number of replicas amount of
RAM. But that doesn't have to be naive.

To sum up, I am not aware of any principal limitations in this
area.

-- 
Konstantin Osipov, Moscow, Russia

Re: Built-in Raft replication

From

Konstantin Osipov

Date:

16 April, 12:58:32

* Andrey Borodin <x4mmm@yandex-team.ru> [25/04/16 11:06]:
> > Andrey Borodin <x4mmm@yandex-team.ru> writes:
> >> I think it's what Konstantin is proposing. To have our own Raft implementation, without dependencies.
> > 
> > Hmm, OK.  I thought that the proposal involved relying on some existing
> > code, but re-reading the thread that was said nowhere.  Still, that
> > moves it from a large project to a really large project :-(
> > 
> > I continue to think that it'd be best to try to implement it as
> > an extension, at least up till the point of finding show-stopping
> > reasons why it cannot be that.
> 
> I think I can provide some reasons why it cannot be neither extension, nor any part running within postmaster reign.
> 
> 1. When joining cluster, there’s not PGDATA to run postmaster on top of it.
> 
> 2. After failover, old Primary node must rejoin cluster by running pg_rewind and following timeline switch.
> 
> The system in hand must be able to manipulate with PGDATA without starting Postgres.
> 
> My question to Konstantin is Why wouldn’t you just add Raft to Patroni? Is there a reason why something like Patroni
isnot in core and noone rushes to get it in?

> Everyone is using it, or system like it.

Raft uses the same WAL to store configuration change records as is used
for commit records. This is at the core of the correctness of the
algorithm. This is also my biggest concern with correctness of
Patroni - but to the best of my knowledge 's 90%+ of
use cases of Patroni use a "fixed" quorum size, that's defined at
start of the deployment and never/rarely changes.
Contrast to that being able to a replica to the quorum at any
time, and all it takes is just starting this replica and pointing
it at the existing cluster. This greatly simplifies UX.

-- 
Konstantin Osipov, Moscow, Russia

Re: Built-in Raft replication

From

Konstantin Osipov

Date:

16 April, 13:02:00

* Andrey Borodin <x4mmm@yandex-team.ru> [25/04/16 11:06]:
> > You can run bash from extension, what's the point?
> 
> You cannot run bash that will stop backend running bash.

You're right there is a chicken and egg problem when you add 
Raft to an existing project, and rebootstrap 
becomes a trick, but it's a plumbing trick.

The new member needs to generate and persist a globally unique
identifier as the first step. Later it can 
reintroduce itself to the cluster given this identifier
can be preserved in the new incarnation (popen + fork).

-- 
Konstantin Osipov, Moscow, Russia

Re: Built-in Raft replication

From

Alastair Turner

Date:

16 April, 15:53:23

On Wed, 16 Apr 2025 at 07:18, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:

On Wed, Apr 16, 2025 at 10:29 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> If you use build-in failover you have to resort to 3 big Postgres machines because you need 2/3 majority. Of course, you can install MySQL-stype arbiter - host that had no real PGDATA, only participates in voting. But this is a solution to problem induced by built-in autofailover.

Users find it a waste of resources to deploy 3 big PostgreSQL
instances just for HA where 2 suffice even if they deploy 3
lightweight DCS instances. Having only some of the nodes act as DCS
and others purely PostgreSQL nodes will reduce waste of resources.

The experience of other projects/products with automated failover based on quorum shows that this is a critical issue for adoption. In the In-memory Data Grid space (Coherence, Geode/GemFire) the question of how to ensure that some nodes didn't carry any data comes up early in many architecture discussions. When RabbitMQ shipped their Quorum Queues feature, the first and hardest area of pushback was around all nodes hosting message content.

It's not just about the requirement for compute resources, it's also about bandwidth and latency. Many large organisations have, for historical reasons, pairs of data centres with very good point-to-point connectivity. As the requirement for quorum witnesses has come up for all sorts of things, including storage arrays, they have built arbiter/witness sites at branches, colocation providers or even on the public cloud. More than not holding user data or processing queries, the arbiter can't even be sent the replication stream for the user data in the database, it just won't fit down the pipe.

Which feels like a very difficult requirement to meet if the replication model for all data is being changed to a quorum model.

Regards
Alastair

Re: Built-in Raft replication

From

Konstantin Osipov

Date:

16 April, 17:07:36

* Alastair Turner <minion@decodable.me> [25/04/16 15:58]:

> > > If you use build-in failover you have to resort to 3 big Postgres
> > machines because you need 2/3 majority. Of course, you can install
> > MySQL-stype arbiter - host that had no real PGDATA, only participates in
> > voting. But this is a solution to problem induced by built-in autofailover.
> >
> > Users find it a waste of resources to deploy 3 big PostgreSQL
> > instances just for HA where 2 suffice even if they deploy 3
> > lightweight DCS instances. Having only some of the nodes act as DCS
> > and others purely PostgreSQL nodes will reduce waste of resources.
> >
> > The experience of other projects/products with automated failover based on
> quorum shows that this is a critical issue for adoption. In the In-memory
> Data Grid space (Coherence, Geode/GemFire) the question of how to ensure
> that some nodes didn't carry any data comes up early in many architecture
> discussions. When RabbitMQ shipped their Quorum Queues feature, the first
> and hardest area of pushback was around all nodes hosting message content.
> 
> It's not just about the requirement for compute resources, it's also about
> bandwidth and latency. Many large organisations have, for historical
> reasons, pairs of data centres with very good point-to-point connectivity.
> As the requirement for quorum witnesses has come up for all sorts of
> things, including storage arrays, they have built arbiter/witness sites at
> branches, colocation providers or even on the public cloud. More than not
> holding user data or processing queries, the arbiter can't even be sent the
> replication stream for the user data in the database, it just won't fit
> down the pipe.
> 
> Which feels like a very difficult requirement to meet if the replication
> model for all data is being changed to a quorum model.

I agree master/replica deployment layouts are very popular and are
not going to directly benefit from raft. They'll still work, but no
automation will be available, just like today with Patroni.

However, if the storage cost is an argument, then the logical path is to
disaggregate storage/compute altogether, i.e. use projects like
neon.

-- 
Konstantin Osipov, Moscow, Russia

Re: Built-in Raft replication

From

Yura Sokolov

Date:

16 April, 17:29:09

16.04.2025 08:24, Andrey Borodin пишет:
> 2. After failover, old Primary node must rejoin cluster by running pg_rewind and following timeline switch.

It is really do-able: BiHA already does it. And BiHA runs as a child
process of postmaster, ie both postmaster and BiHA doesn't restart when
PostgreSQL needs to rewind and restart.

Yes, there are non-trivial amount of changes made into postmaster
machinery. But it is doable.

-- 
regards
Yura Sokolov aka funny-falcon

Re: Built-in Raft replication

From

Yura Sokolov

Date:

16 April, 17:35:27

16.04.2025 07:58, Andrey Borodin пишет:
> Yes, shared DCS are common these days. AFAIK, we use one Zookeeper instance per hundred Postgres clusters to
coordinatepg_consuls.
 
> 
> Actually, scalability is opposite to topic of this thread. Let me explain.
> Currently, Postgres automatic failover tools rely on databases with built-in automatic failover. Konstantin is
proposingto shorten this loop and make Postgres use its build-in automatic failover.
 
> 
> So, existing tooling allows you to have 3 hosts for DCS, with majority of 2 hosts able to elect new leader in case of
failover.
> And you can have only 2 hosts for Postgres - Primary and Standby. You can have 2 big Postgres machines with 64 CPUs.
And3 one-CPU hosts for Zookeper\etcd.
 
> 
> If you use build-in failover you have to resort to 3 big Postgres machines because you need 2/3 majority. Of course,
youcan install MySQL-stype arbiter - host that had no real PGDATA, only participates in voting. But this is a solution
toproblem induced by built-in autofailover.
 

Arbiter can store WAL without (almost) any data. Then it is not only for
voting, but also for reliability as almost full featured third server.

Certainly, it may become only "read-only master" - just to replicate WAL's
tail it has and commit it by commiting record in new term/timeline. Then it
should give leadership to other replica immediately.

This idea is not a fantasy. BiHA does it.

-- 
regards
Yura Sokolov aka funny-falcon

Re: Built-in Raft replication

From

Greg Sabino Mullane

Date:

16 April, 22:29:05

On Wed, Apr 16, 2025 at 2:18 AM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:

Users find it a waste of resources to deploy 3 big PostgreSQL instances just for HA where 2 suffice even if they deploy 3 lightweight DCS instances. Having only some of the nodes act as DCS and others purely PostgreSQL nodes will reduce waste of resources.

A big problem is that putting your DCS into Postgres means your whole system is now super-sensitive to IO/WAL-streaming issues, and a busy database doing database stuff is going to start affecting the DCS stuff. With three lightweight DCS servers, you don't really need to worry about how stressed the database servers are. In that way, I feel etcd et al. are adhering to the unix philosophy of "do one thing, and do it well."

Cheers,

Greg

Crunchy Data - https://www.crunchydata.com

Enterprise Postgres Software Products & Tech Support

Re: Built-in Raft replication

From

Konstantin Osipov

Date:

16 April, 22:35:34

* Greg Sabino Mullane <htamfids@gmail.com> [25/04/16 22:33]:
> > Users find it a waste of resources to deploy 3 big PostgreSQL instances
> > just for HA where 2 suffice even if they deploy 3 lightweight DCS
> > instances. Having only some of the nodes act as DCS and others purely
> > PostgreSQL nodes will reduce waste of resources.
> >
> 
> A big problem is that putting your DCS into Postgres means your whole
> system is now super-sensitive to IO/WAL-streaming issues, and a busy
> database doing database stuff is going to start affecting the DCS stuff.

Affecting in what way? Do you have a scenario in mind where an
external state provider would act differently (better)? 

> With three lightweight DCS servers, you don't really need to worry about
> how stressed the database servers are. In that way, I feel etcd et al. are
> adhering to the unix philosophy of "do one thing, and do it well."

-- 
Konstantin Osipov, Moscow, Russia

Re: Built-in Raft replication

From

Hannu Krosing

Date:

17 April, 00:24:09

On Wed, Apr 16, 2025 at 6:27 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Andrey Borodin <x4mmm@yandex-team.ru> writes:
> > I think it's what Konstantin is proposing. To have our own Raft implementation, without dependencies.
>
> Hmm, OK.  I thought that the proposal involved relying on some existing
> code, but re-reading the thread that was said nowhere.  Still, that
> moves it from a large project to a really large project :-(

My understanding is that RAFT is a fancy name for what PostgreSQL is
largely already doing - "electing" a leader and then doing all the
changes through that leader (a.k.a. WAL streaming)

The thing that needs adding - and which makes it "RAFT" instead of
just a streaming replication with a failover - is what happens when
the leader goes away and one of the replicas needs to become a new
leader.

We have ways to do rollbacks and roll-forwards but the main tricky
part is "how do you know that you have not lost some changes" and here
we must have some place to store the info about at which LSN the
failover happened, so that we know to run pg_rewind if any losts hosts
come back and want to join.

And of course we need to have a way to communicate this "who is the
new leader" to clients which needs new libpgq functionality of
"failover connection pools" which hide the failovers from old clients.

The RAFT protocol could be a provider of "who is current leader info"
and optionally cache the LSN the switch happened.

> I continue to think that it'd be best to try to implement it as
> an extension, at least up till the point of finding show-stopping
> reasons why it cannot be that.

The main thing I would like to see in core is ability to do clean
*switchovers* (not failovers) by sending a magic WAL record with a
message "hey node N, please take over the write node role" over WAL so
that node N knows to self-promote and all other nodes know to start
following N starting  from the next WAL record either directly or

Why WAL - because it is already is sent to all replicas, it guarantees
continuity as it very clearly states at what LSN the write-head-switch
happens.

We also should be able to send this info to the client libraries
currently connected to the writer. so that they can choose to switch
to the new head.

The rest could be easily an extension.

Mainly we want more than one "coordinators" which can be running in
some or all of the nodes.and who agree on
- which node is current leader
- at which LSN the switch happened (so if some node coming back
discovers that it has magicall moved ahead it knows to rewind to that
LSN and then re-stream it from commonly agreed place.

 It would also be nice to have some agreed, hopefully lightweight,
notion of node identity, which we could then use for many things,
including stamping it in WAL records to guarantee / verify that all
the nodes have been on the same WAL stream all the time

But regarding weather to use RAFT I would just define a "coordinator
API" and leave it up to the specific coordinator/consensus extension
to decide how the consensus is achieved

So to summarize:

# Core should provide

- way tomove to new node,
  - for switchover a WAL-based switchover
  - for failover something similar which also writes the WAL record so
all histories are synced
- a libpq message informing clients about "new write head node"
- node IDs and more general c;luster-awareness inside the PostgreSQL
node (I had a shoutout about this in a recent pgconf.dev unconference
talk)
- a new write-node field in WAL to track write head movement
- API for a joining node to find out which cluster it joins and the
switchover history
  - in WAL it is always switchover, maybe with some info saying that
it was a forces switchover because we lost old write head
  - if some lost node comes back it may need to rewind or
re-initialize if it finds out it had been following a lost timeline
that is not fully part of

NOTE: switchovers in WAL would be very similar to timeline changes. I
am not sure how much extra info is needed there.

# Extension can provide
- agreeing on new leader node in case of failover
  - protocol can be RAFT, PAXOS or "the DBA says so" :)
- sharing fresh info about current leader and switch timelines (though
this should more likely be in core)
- anything else ???

# external apps is (likely?) needed for
- setting up cluster, provisioning machines / VMs
- setting up networking
- starting PostgreSQL servers.
- spinning up and down clients,
- communicating current leader and replica set (could be done by DNS
with agreed conventions)

Re: Built-in Raft replication

From

Alastair Turner

Date:

17 April, 00:45:45

Hi Konstantin

On Wed, 16 Apr 2025 at 15:07, Konstantin Osipov <kostja.osipov@gmail.com> wrote:

* Alastair Turner <minion@decodable.me> [25/04/16 15:58]:

> > > If you use build-in failover you have to resort to 3 big Postgres
> > machines because you need 2/3 majority. Of course, you can install
> > MySQL-stype arbiter - host that had no real PGDATA, only participates in
> > voting. But this is a solution to problem induced by built-in autofailover.
> >
> > Users find it a waste of resources to deploy 3 big PostgreSQL
> > instances just for HA where 2 suffice even if they deploy 3
> > lightweight DCS instances. Having only some of the nodes act as DCS
> > and others purely PostgreSQL nodes will reduce waste of resources.
> >
> > The experience of other projects/products with automated failover based on
> quorum shows that this is a critical issue for adoption. In the In-memory
> Data Grid space (Coherence, Geode/GemFire) the question of how to ensure
> that some nodes didn't carry any data comes up early in many architecture
> discussions. When RabbitMQ shipped their Quorum Queues feature, the first
> and hardest area of pushback was around all nodes hosting message content.
>
> It's not just about the requirement for compute resources, it's also about
> bandwidth and latency. Many large organisations have, for historical
> reasons, pairs of data centres with very good point-to-point connectivity.
> As the requirement for quorum witnesses has come up for all sorts of
> things, including storage arrays, they have built arbiter/witness sites at
> branches, colocation providers or even on the public cloud. More than not
> holding user data or processing queries, the arbiter can't even be sent the
> replication stream for the user data in the database, it just won't fit
> down the pipe.
>
> Which feels like a very difficult requirement to meet if the replication
> model for all data is being changed to a quorum model.

I agree master/replica deployment layouts are very popular and are
not going to directly benefit from raft. They'll still work, but no
automation will be available, just like today with Patroni.

Users of Patroni and etcd setups can get automation for two-site primary/replica pairs by putting a third etcd node on a third site. Which only requires moving the membership/leadership data to the arbiter site, not all database activity.

However, if the storage cost is an argument, then the logical path is to
disaggregate storage/compute altogether, i.e. use projects like
neon.

The issue is not generally storage, but network. There may simply not be enough bandwidth available to transmit the whole WAL to the arbiter site.

Many on-premises IT setups have this limitation in some form.

If your proposal would leave these large, traditional user organisations (which account for thousands of Postgres HA pairs or DR pairs) doing what they currently do with wraparound tooling like Patroni, and create a new, in core, option for balanced 3, 5, 7... member groups, then I don't think it's worth doing.

Regards,

Alastair

Re: Built-in Raft replication

From

Yura Sokolov

Date:

17 April, 12:50:58

17.04.2025 00:24, Hannu Krosing пишет:
> But regarding weather to use RAFT I would just define a "coordinator
> API" and leave it up to the specific coordinator/consensus extension
> to decide how the consensus is achieved
> 
> 
> So to summarize:
> 
> # Core should provide
> 
> - way tomove to new node,
>   - for switchover a WAL-based switchover
>   - for failover something similar which also writes the WAL record so
> all histories are synced
> - a libpq message informing clients about "new write head node"
> - node IDs and more general c;luster-awareness inside the PostgreSQL
> node (I had a shoutout about this in a recent pgconf.dev unconference
> talk)
> - a new write-node field in WAL to track write head movement
> - API for a joining node to find out which cluster it joins and the
> switchover history
>   - in WAL it is always switchover, maybe with some info saying that
> it was a forces switchover because we lost old write head
>   - if some lost node comes back it may need to rewind or
> re-initialize if it finds out it had been following a lost timeline
> that is not fully part of

Great summary!

I'd like to add:
- ability to switch to specific Timeline (to match with Raft Term).
  Timeline numbering is really better to be coordinated.
  It is not so unusual when different replicas become leader in the same
Timeline with current mechanism, and it should be forbidden at any cost.

- remove of forced checkpoint at replica promotion, or make it background
instead of blocking.
  It is really huge issue reported by our support.

- as possible improvement, WAL record of "certainly replicated LSN" - ie,
LSN known to settled in quorum's WALs. RAFT uses it as an optimisation of
server start/leader promotion: it reduces amount of log to search last such
point.

> NOTE: switchovers in WAL would be very similar to timeline changes. I
> am not sure how much extra info is needed there.
> 
> # Extension can provide
> - agreeing on new leader node in case of failover
>   - protocol can be RAFT, PAXOS or "the DBA says so" :)
> - sharing fresh info about current leader and switch timelines (though
> this should more likely be in core)
> - anything else ???
> 
> # external apps is (likely?) needed for
> - setting up cluster, provisioning machines / VMs
> - setting up networking
> - starting PostgreSQL servers.
> - spinning up and down clients,
> - communicating current leader and replica set (could be done by DNS
> with agreed conventions)

-- 
regards
Yura Sokolov aka funny-falcon

Re: Built-in Raft replication

From

Jim Nasby

Date:

23 April, 19:48:46

On Apr 16, 2025, at 2:29 PM, Greg Sabino Mullane <htamfids@gmail.com> wrote:

On Wed, Apr 16, 2025 at 2:18 AM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
Users find it a waste of resources to deploy 3 big PostgreSQL instances just for HA where 2 suffice even if they deploy 3 lightweight DCS instances. Having only some of the nodes act as DCS and others purely PostgreSQL nodes will reduce waste of resources.

A big problem is that putting your DCS into Postgres means your whole system is now super-sensitive to IO/WAL-streaming issues, and a busy database doing database stuff is going to start affecting the DCS stuff. With three lightweight DCS servers, you don't really need to worry about how stressed the database servers are. In that way, I feel etcd et al. are adhering to the unix philosophy of "do one thing, and do it well.”

… unless we added multiple WAL streams. That would allow for splitting WAL traffic across multiple devices as well as providing better support for configurations that don’t replicate the entire cluster. The current situation where delayed replication of a single table mandates retention of all the WAL for the entire cluster is less than ideal.

Re: Built-in Raft replication

From

Devrim Gündüz

Date:

23 April, 20:01:05

Hi,

On Wed, 2025-04-23 at 11:48 -0500, Jim Nasby wrote:
> unless we added multiple WAL streams. That would allow for splitting
> WAL traffic across multiple devices as well as providing better
> support for configurations that don’t replicate the entire cluster.
> The current situation where delayed replication of a single table
> mandates retention of all the WAL for the entire cluster is less than
> ideal.

I think the problem is handling the stream of global objects. Having
separate stream for each database would be awesome as long as it can
deal with the "global stream".

Regards,
--
Devrim Gündüz
Open Source Solution Architect, PostgreSQL Major Contributor
BlueSky: @devrim.gunduz.org , @gunduz.org

Attachment

signature.asc

Re: Built-in Raft replication

From

Jim Nasby

Date:

29 April, 23:16:07

I've always assumed there'd have to be at least one global stream, if for no other purpose than to be the source of truth about transaction commit ordering (though, I was thinking of supporting multiple streams for one database). Presumably the same could be used for shared objects. Or perhaps shared objects just get their own stream. Either way, having a master commit record that points at LSNs of various other streams is what I'd been thinking.

On Wed, Apr 23, 2025 at 12:01 PM Devrim Gündüz <devrim@gunduz.org> wrote:

Hi,

On Wed, 2025-04-23 at 11:48 -0500, Jim Nasby wrote:
> unless we added multiple WAL streams. That would allow for splitting
> WAL traffic across multiple devices as well as providing better
> support for configurations that don’t replicate the entire cluster.
> The current situation where delayed replication of a single table
> mandates retention of all the WAL for the entire cluster is less than
> ideal.

I think the problem is handling the stream of global objects. Having
separate stream for each database would be awesome as long as it can
deal with the "global stream".

Regards,
--
Devrim Gündüz
Open Source Solution Architect, PostgreSQL Major Contributor
BlueSky: @devrim.gunduz.org , @gunduz.org