Thread: Built-in Raft replication

Built-in Raft replication

From
Konstantin Osipov
Date:
Hi,

I am considering starting work on implementing a built-in Raft
replication for PostgreSQL.

Raft's advantage is that it unifies log replication, cluster
configuration/membership/topology management and initial state transfer 
into a single protocol. 

Currently the cluster configuration/topology is often managed by
Patroni, or similar tools, however, it seems there are certain
usability drawbacks with this approach: 

- it's a separate tool, requiring an external state provider like etcd;
  raft could store its configuration in system tables; this is
  also an observability improvement since everyone could look up 
  cluster state the same way as everything else

- same for watchdog; raft has a built-in failure detector that's
  configuration aware;

- flexible quorums; currently quorum size is a configurable; 
  with a built-in raft, extending the quorum could be a matter
  of starting a new node and pointing it to an existing cluster

Going forward I can see PostgreSQL providing transparent bouncing
on pg_wire level, given that Raft state is now part of the
system, so drivers and all cluster nodes could easily see where
the leader is. 

If anyone is working on Raft already I'd be happy to discuss
the details. I am fairly new to the PostgreSQL hackers ecosystem
so cautious of starting work in isolation/knowing there is no
interest in accepting the feature into the trunk.

Thanks,

-- 
Konstantin Osipov



Re: Built-in Raft replication

From
Kirill Reshke
Date:
On Mon, 14 Apr 2025 at 22:15, Konstantin Osipov <kostja.osipov@gmail.com> wrote:
>
> Hi,

Hi

> I am considering starting work on implementing a built-in Raft
> replication for PostgreSQL.
>

Just some thought on top of my mind, if you need my voice here:

I have a hard time believing the community will be positive about this
change in-core. It has more changes as contrib extension. In fact, if
we want a built-in consensus algorithm, Paxos is a better option,
because you can use postgresql as local crash-safe storage for single
decree paxos, just store your state (ballot number, last voice) in a
heap table.
OTOH Raft needs to write its own log, and what's worse, it sometimes
needs to remove already written parts of it (so, it is not appended
only, unlike WAL). If you have a production system which maintains two
kinds of logs with different semantics, it is a very hard system to
maintain..

There is actually a prod-ready (non open source) implementation of
RAFT as extension, called BiHA, by pgpro.

Just some thought on top of my mind, if you need my voice here.

-- 
Best regards,
Kirill Reshke



Re: Built-in Raft replication

From
Konstantin Osipov
Date:
* Kirill Reshke <reshkekirill@gmail.com> [25/04/14 20:48]:
> > I am considering starting work on implementing a built-in Raft
> > replication for PostgreSQL.
> >
> 
> Just some thought on top of my mind, if you need my voice here:
> 
> I have a hard time believing the community will be positive about this
> change in-core. It has more changes as contrib extension. In fact, if
> we want a built-in consensus algorithm, Paxos is a better option,
> because you can use postgresql as local crash-safe storage for single
> decree paxos, just store your state (ballot number, last voice) in a
> heap table.

But Raft is a log replication algorithm, not a consensus
algorithm. It does use consensus, but that's for leader election.
Paxos could be used for log replication, but that would be
expensive. In fact etcd uses Raft, and etcd is used by Patroni. So
I completely lost your line of thought here.

> OTOH Raft needs to write its own log, and what's worse, it sometimes
> needs to remove already written parts of it (so, it is not appended
> only, unlike WAL). If you have a production system which maintains two
> kinds of logs with different semantics, it is a very hard system to
> maintain..

My proposal is exactly to replace (or rather, extend) the current
synchronous log replication with Raft. Entry removal is possible to
stack on top of append-only format, and production implementations
exist which do that.

So, no, it's a single log, and in fact the current WAL will do.

> There is actually a prod-ready (non open source) implementation of
> RAFT as extension, called BiHA, by pgpro.

My guess biha is an extension since a proprietary code is easier
to maintain that way. I'd rather say the fact that there is a
proprietary implementation out in the field confirms it could be a
good idea to have it in PostgreSQL trunk. 

In any case I'm interested in contributing to the trunk, not
building a proprietary module/fork.

-- 
Konstantin Osipov



Re: Built-in Raft replication

From
Yura Sokolov
Date:
14.04.2025 20:44, Kirill Reshke пишет:
> OTOH Raft needs to write its own log, and what's worse, it sometimes
> needs to remove already written parts of it (so, it is not appended
> only, unlike WAL). If you have a production system which maintains two
> kinds of logs with different semantics, it is a very hard system to
> maintain..

Raft is log replication protocol which uses log position and term.
But... PostgreSQL already have log position and term in its WAL structure.
PostgreSQL's timeline is actually the Term.
Raft implementer needs just to correct rules for Term/Timeline switching:
- instead of "next TimeLine number is just increment of largest known
TimeLine number" it needs to be "next TimeLine number is the result of
Leader Election".

And yes, "it sometimes needs to remove already written parts of it".
But... It is exactly what every PostgreSQL's cluster manager software have
to do to join previous leader as a follower to new leader - pg_rewind.

So, PostgreSQL already have 70-90%% of Raft implementation details.
Raft doesn't have to be implemented in PostgreSQL.
Raft has to be finished!!!

PS: One of the biggest issues is forced snapshot on replica promotion. It
really slows down leader switch time. It looks like it is not really
needed, or some small workaround should be enough.

-- 
regards
Yura Sokolov aka funny-falcon



Re: Built-in Raft replication

From
Aleksander Alekseev
Date:
Hi Konstantin,

> I am considering starting work on implementing a built-in Raft
> replication for PostgreSQL.

Generally speaking I like the idea. The more important question IMO is
whether we want to maintain Raft within the PostgreSQL core project.

Building distributed systems on commodity hardware was a popular idea
back in the 2000s. These days you can rent a server with 2 Tb of RAM
for something like 2000 USD/month (numbers from my memory that were
valid ~5 years ago) which will fit many of the existing businesses (!)
in memory. And you can rent another one for a replica, just in order
not to recover from a backup if something happens to your primary
server. The common wisdom is if you can avoid building distributed
systems, don't build one.

Which brings the question if we want to maintain something like this
(which will include logic for cases when a node joins or leaves the
cluster, proxy server / service discovery for clients, test cases /
infrastructure for all this and also upgrading the cluster, docs, ...)
for a presumably view users which business doesn't fit in a single
server *and* they want an automatic failover (not the manual one)
*and* they don't use Patroni/Stolon/CockroachDB/Neon/... already.

Although the idea is tempting personally I'm inclined to think that
it's better to invest community resources into something else.

-- 
Best regards,
Aleksander Alekseev



Re: Built-in Raft replication

From
Yura Sokolov
Date:
15.04.2025 13:20, Aleksander Alekseev пишет:
> Hi Konstantin,
> 
>> I am considering starting work on implementing a built-in Raft
>> replication for PostgreSQL.
> 
> Generally speaking I like the idea. The more important question IMO is
> whether we want to maintain Raft within the PostgreSQL core project.
> 
> Building distributed systems on commodity hardware was a popular idea
> back in the 2000s. These days you can rent a server with 2 Tb of RAM
> for something like 2000 USD/month (numbers from my memory that were
> valid ~5 years ago) which will fit many of the existing businesses (!)
> in memory. And you can rent another one for a replica, just in order
> not to recover from a backup if something happens to your primary
> server. The common wisdom is if you can avoid building distributed
> systems, don't build one.
> 
> Which brings the question if we want to maintain something like this
> (which will include logic for cases when a node joins or leaves the
> cluster, proxy server / service discovery for clients, test cases /
> infrastructure for all this and also upgrading the cluster, docs, ...)
> for a presumably view users which business doesn't fit in a single
> server *and* they want an automatic failover (not the manual one)
> *and* they don't use Patroni/Stolon/CockroachDB/Neon/... already.
> 
> Although the idea is tempting personally I'm inclined to think that
> it's better to invest community resources into something else.

Raft is not for "commodity hardware". It is for reliability.
Yes, it needs 3 servers instead of 2. It costs more than simple replication
with "manual" failover.

But if business needs high availability, it wouldn't rely on "manual"
failover. And if business relies on correctness, it would rely on any
solution which "automatically switches between two replicas". Because there
is no way to guarantee correctness with just two replicas. And many stories
of lost transactions with Patroni/Stolon already confirms this thesis.

CockroachDB/Neon - they are good solutions for distributed systems. But, as
you've said, many clients don't need distributed systems. They just need
reliable replication.

I've been working in a company which uses MongoDB (3.6 and up) as their
primary storage. And it seemed to me as "God Send". Everything just worked.
Replication was as reliable as one could imagine. It outlives several
hardware incidents without manual intervention. It allowed cluster
maintenance (software and hardware upgrades) without application downtime.
I really dream PostgreSQL will be as reliable as MongoDB without need of
external services.

-- 
regards
Yura Sokolov aka funny-falcon



Re: Built-in Raft replication

From
Konstantin Osipov
Date:
* Yura Sokolov <y.sokolov@postgrespro.ru> [25/04/15 12:02]:

> > OTOH Raft needs to write its own log, and what's worse, it sometimes
> > needs to remove already written parts of it (so, it is not appended
> > only, unlike WAL). If you have a production system which maintains two
> > kinds of logs with different semantics, it is a very hard system to
> > maintain..
> 
> Raft is log replication protocol which uses log position and term.
> But... PostgreSQL already have log position and term in its WAL structure.
> PostgreSQL's timeline is actually the Term.
> Raft implementer needs just to correct rules for Term/Timeline switching:
> - instead of "next TimeLine number is just increment of largest known
> TimeLine number" it needs to be "next TimeLine number is the result of
> Leader Election".
> 
> And yes, "it sometimes needs to remove already written parts of it".
> But... It is exactly what every PostgreSQL's cluster manager software have
> to do to join previous leader as a follower to new leader - pg_rewind.
> 
> So, PostgreSQL already have 70-90%% of Raft implementation details.
> Raft doesn't have to be implemented in PostgreSQL.
> Raft has to be finished!!!
> 
> PS: One of the biggest issues is forced snapshot on replica promotion. It
> really slows down leader switch time. It looks like it is not really
> needed, or some small workaround should be enough.

I'd say my pet peeve is storing the cluster topology (the so
called raft configuration) inside the database, not in an external
state provider. Agree on other points.

-- 
Konstantin Osipov



Re: Built-in Raft replication

From
Aleksander Alekseev
Date:
Hi Yura,

> I've been working in a company which uses MongoDB (3.6 and up) as their
> primary storage. And it seemed to me as "God Send". Everything just worked.
> Replication was as reliable as one could imagine. It outlives several
> hardware incidents without manual intervention. It allowed cluster
> maintenance (software and hardware upgrades) without application downtime.
> I really dream PostgreSQL will be as reliable as MongoDB without need of
> external services.

I completely understand. I had exactly the same experience with
Stolon. Everything just worked. And the setup took like 5 minutes.

It's a pity this project doesn't seem to get as much attention as
Patroni. Probably because attention requires traveling and presenting
the project at conferences which costs money. Or perhaps people are
just happy with Patroni. I'm not sure in which state Stolon is today.

-- 
Best regards,
Aleksander Alekseev



Re: Built-in Raft replication

From
Yura Sokolov
Date:
15.04.2025 14:15, Aleksander Alekseev пишет:
> Hi Yura,
> 
>> I've been working in a company which uses MongoDB (3.6 and up) as their
>> primary storage. And it seemed to me as "God Send". Everything just worked.
>> Replication was as reliable as one could imagine. It outlives several
>> hardware incidents without manual intervention. It allowed cluster
>> maintenance (software and hardware upgrades) without application downtime.
>> I really dream PostgreSQL will be as reliable as MongoDB without need of
>> external services.
> 
> I completely understand. I had exactly the same experience with
> Stolon. Everything just worked. And the setup took like 5 minutes.
> 
> It's a pity this project doesn't seem to get as much attention as
> Patroni. Probably because attention requires traveling and presenting
> the project at conferences which costs money. Or perhaps people are
> just happy with Patroni. I'm not sure in which state Stolon is today.

But the key point: if PostgreSQL will be improved a bit, there will be no
need neither in Patroni, nor in Stolon. Isn't it great?

-- 
regards
Yura Sokolov aka funny-falcon



Re: Built-in Raft replication

From
Konstantin Osipov
Date:
* Aleksander Alekseev <aleksander@timescale.com> [25/04/15 13:20]:
> > I am considering starting work on implementing a built-in Raft
> > replication for PostgreSQL.
> 
> Generally speaking I like the idea. The more important question IMO is
> whether we want to maintain Raft within the PostgreSQL core project.
> 
> Building distributed systems on commodity hardware was a popular idea
> back in the 2000s. These days you can rent a server with 2 Tb of RAM
> for something like 2000 USD/month (numbers from my memory that were
> valid ~5 years ago) which will fit many of the existing businesses (!)
> in memory. And you can rent another one for a replica, just in order
> not to recover from a backup if something happens to your primary
> server. The common wisdom is if you can avoid building distributed
> systems, don't build one.
> 
> Which brings the question if we want to maintain something like this
> (which will include logic for cases when a node joins or leaves the
> cluster, proxy server / service discovery for clients, test cases /
> infrastructure for all this and also upgrading the cluster, docs, ...)
> for a presumably view users which business doesn't fit in a single
> server *and* they want an automatic failover (not the manual one)
> *and* they don't use Patroni/Stolon/CockroachDB/Neon/... already.
> 
> Although the idea is tempting personally I'm inclined to think that
> it's better to invest community resources into something else.

My personal take away from this as a community member would be
seamless coordinator failover in Greenplum and all of its forks
(CloudBerry, Greengage, synxdata, what not). I also imagine there
is a number of PostgreSQL derivatives that could benefit from
built-in transparent failover since it standardizes the solution
space.

-- 
Konstantin Osipov



Re: Built-in Raft replication

From
Konstantin Osipov
Date:
* Yura Sokolov <y.sokolov@postgrespro.ru> [25/04/15 14:02]:
> I've been working in a company which uses MongoDB (3.6 and up) as their
> primary storage. And it seemed to me as "God Send". Everything just worked.
> Replication was as reliable as one could imagine. It outlives several
> hardware incidents without manual intervention. It allowed cluster
> maintenance (software and hardware upgrades) without application downtime.
> I really dream PostgreSQL will be as reliable as MongoDB without need of
> external services.

thanks for pointing out mongodb, so built-in raft would help
ferretdb as well.

-- 
Konstantin Osipov



Re: Built-in Raft replication

From
Greg Sabino Mullane
Date:
On Mon, Apr 14, 2025 at 1:15 PM Konstantin Osipov <kostja.osipov@gmail.com> wrote:
If anyone is working on Raft already I'd be happy to discuss
the details. I am fairly new to the PostgreSQL hackers ecosystem
so cautious of starting work in isolation/knowing there is no
interest in accepting the feature into the trunk.

Putting aside the technical concerns about this specific idea, it's best to start by laying out a very detailed plan of what you would want to change, and what you see as the costs and benefits. It's also extremely helpful to think about developing this as an extension. If you get stuck due to extension limitations, propose additional hooks. If the hooks will not work, explain why.

Getting this into core is going to be a long, multi-year effort, in which people are going to be pushing back the entire time, so prepare yourself for that. My immediate retort is going to be: why would we add this if there are existing tools that already do the job just fine? Postgres has lots of tasks that it is happy to let other programs/OS subsystems/extensions/etc. handle instead.
 
Cheers,
Greg

--
Enterprise Postgres Software Products & Tech Support

Re: Built-in Raft replication

From
Konstantin Osipov
Date:
* Greg Sabino Mullane <htamfids@gmail.com> [25/04/15 18:08]:

> > If anyone is working on Raft already I'd be happy to discuss
> > the details. I am fairly new to the PostgreSQL hackers ecosystem
> > so cautious of starting work in isolation/knowing there is no
> > interest in accepting the feature into the trunk.
> >
> 
> Putting aside the technical concerns about this specific idea, it's best to
> start by laying out a very detailed plan of what you would want to change,
> and what you see as the costs and benefits. It's also extremely helpful to
> think about developing this as an extension. If you get stuck due to
> extension limitations, propose additional hooks. If the hooks will not
> work, explain why.
> 
> Getting this into core is going to be a long, multi-year effort, in which
> people are going to be pushing back the entire time, so prepare yourself
> for that. My immediate retort is going to be: why would we add this if
> there are existing tools that already do the job just fine? Postgres has
> lots of tasks that it is happy to let other programs/OS
> subsystems/extensions/etc. handle instead.

I had hoped I explained why external state providers can not
provide the same seamless UX as built-in ones. The key idea is to
have a built-in configuration management, so that adding and
removing replicas does not require changes in multiple disjoint
parts of the installation (server configurations, proxies,
clients).

I understand and accept that it's a multi-year effort, but I do
not accept the retort - my main point is that external tools
are not a replacement, and I'd like to reach consensus on that.

-- 
Konstantin Osipov, Moscow, Russia



Re: Built-in Raft replication

From
Nikolay Samokhvalov
Date:
On Tue, Apr 15, 2025 at 8:08 AM Greg Sabino Mullane <htamfids@gmail.com> wrote:
On Mon, Apr 14, 2025 at 1:15 PM Konstantin Osipov <kostja.osipov@gmail.com> wrote:
If anyone is working on Raft already I'd be happy to discuss
the details. I am fairly new to the PostgreSQL hackers ecosystem
so cautious of starting work in isolation/knowing there is no
interest in accepting the feature into the trunk.

Putting aside the technical concerns about this specific idea, it's best to start by laying out a very detailed plan of what you would want to change, and what you see as the costs and benefits. It's also extremely helpful to think about developing this as an extension. If you get stuck due to extension limitations, propose additional hooks. If the hooks will not work, explain why.

This is exactly what I wanted to write as well. The idea is great. At the same time, I think, consensus on many decisions will be extremely hard to reach, so this project has a high risk of being very long. Unless it's an extension, at least in the beginning.

Nik

Re: Built-in Raft replication

From
Tom Lane
Date:
Nikolay Samokhvalov <nik@postgres.ai> writes:
> This is exactly what I wanted to write as well. The idea is great. At the
> same time, I think, consensus on many decisions will be extremely hard to
> reach, so this project has a high risk of being very long. Unless it's an
> extension, at least in the beginning.

Yeah.  The two questions you'd have to get past to get this into PG
core are:

1. Why can't it be an extension?  (You claimed it would work more
seamlessly in core, but I don't think you've made a proven case.)

2. Why depend on Raft rather than some other project?

Longtime PG developers are going to be particularly hard on point 2,
because we have a track record now of outliving outside projects
that we thought we could rely on.  One example here is the Snowball
stemmer; while its upstream isn't quite dead, it's twitching only
feebly, and seems to have a bus factor of 1.  Another example is the
Spencer regex engine; we thought we could depend on Tcl to be the
upstream for that, but for a decade or more they've acted as though
*we* are the upstream.  And then there's libxml2.  And uuid-ossp.
And autoconf.  And various documentation toolchains.  Need I go on?

The great advantage of implementing an outside dependency in an
extension is that if the depended-on project dies, we can say a few
words of mourning and move on.  It's a lot harder to walk away from
in-core features.

            regards, tom lane



Re: Built-in Raft replication

From
Andrey Borodin
Date:

> On 16 Apr 2025, at 04:19, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> feebly, and seems to have a bus factor of 1.  Another example is the
> Spencer regex engine; we thought we could depend on Tcl to be the
> upstream for that, but for a decade or more they've acted as though
> *we* are the upstream.

I think it's what Konstantin is proposing. To have our own Raft implementation, without dependencies.

IMO to better understand what is proposed we need some more description of proposed systems. How the new system will be
configured?initdb and what than? How new node joins cluster? What is running pg_rewind when necessary? 

Some time ago Peter E proposed to be able to start replication atop of empty directory, so that initial sync would be
morestraightforward. And also Heikki proposed to remove archive race condition when choosing new timeline. I think this
stepsare gradual movement in the same direction. 

My view is what Konstantin wants is automatic replication topology management. For some reason this technology is
calledHA, DCS, Raft, Paxos and many other scary words. But basically it manages primary_conn_info of some nodes to
providesome fault-tolerance properties. I'd start to design from here, not from Raft paper. 


Best regards, Andrey Borodin.


Re: Built-in Raft replication

From
Tom Lane
Date:
Andrey Borodin <x4mmm@yandex-team.ru> writes:
> I think it's what Konstantin is proposing. To have our own Raft implementation, without dependencies.

Hmm, OK.  I thought that the proposal involved relying on some existing
code, but re-reading the thread that was said nowhere.  Still, that
moves it from a large project to a really large project :-(

I continue to think that it'd be best to try to implement it as
an extension, at least up till the point of finding show-stopping
reasons why it cannot be that.

            regards, tom lane



Re: Built-in Raft replication

From
Ashutosh Bapat
Date:
On Wed, Apr 16, 2025 at 9:37 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> My view is what Konstantin wants is automatic replication topology management. For some reason this technology is
calledHA, DCS, Raft, Paxos and many other scary words. But basically it manages primary_conn_info of some nodes to
providesome fault-tolerance properties. I'd start to design from here, not from Raft paper. 
>

In my experience, the load of managing hundreds of replicas which all
participate in RAFT protocol becomes more than regular transaction
load. So making every replica a RAFT participant will affect the
ability to deploy hundreds of replica. We may build an extension which
has a similar role in PostgreSQL world as zookeeper in Hadoop. It can
be then used for other distributed systems as well - like shared
nothing clusters based on FDW. There's already a proposal to bring
CREATE SERVER to the world of logical replication - so I see these two
worlds uniting in future. The way I imagine it is some PostgreSQL
instances, which have this extension installed, will act as a RAFT
cluster (similar to Zookeeper ensemble or etcd cluster). The
distributed system based on logical replication or FDW or both will
use this ensemble to manage its shared state. The same ensemble can be
shared across multiple distributed clusters if it has scaling
capabilities.

--
Best Wishes,
Ashutosh Bapat



Re: Built-in Raft replication

From
Andrey Borodin
Date:

> On 16 Apr 2025, at 09:33, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
>
> In my experience, the load of managing hundreds of replicas which all
> participate in RAFT protocol becomes more than regular transaction
> load. So making every replica a RAFT participant will affect the
> ability to deploy hundreds of replica.

No need to make all standbys voting. And no need to make plain topology. pg_consul is using 2/3 or 3/5 HA groups, and
cascadesall others from HA group. 
Existing tools already solve the original problem, Konstantin is just proposing to solve it in some standard “official”
way.

> We may build an extension which
> has a similar role in PostgreSQL world as zookeeper in Hadoop.

Patroni, pg_consul and others already use zookeeper, etcd and similar systems for consensus.
Is it any better as extension than as etcd?

> It can
> be then used for other distributed systems as well - like shared
> nothing clusters based on FDW.

I didn’t get FDW analogy. Why other distributed systems should choose Postgres extension over Zookeeper?

> There's already a proposal to bring
> CREATE SERVER to the world of logical replication - so I see these two
> worlds uniting in future.

Again, I’m lost here. Which two worlds?

> The way I imagine it is some PostgreSQL
> instances, which have this extension installed, will act as a RAFT
> cluster (similar to Zookeeper ensemble or etcd cluster).

That’s exactly what is proposed here.

> The
> distributed system based on logical replication or FDW or both will
> use this ensemble to manage its shared state. The same ensemble can be
> shared across multiple distributed clusters if it has scaling
> capabilities.

Yes, shared DCS are common these days. AFAIK, we use one Zookeeper instance per hundred Postgres clusters to coordinate
pg_consuls.

Actually, scalability is opposite to topic of this thread. Let me explain.
Currently, Postgres automatic failover tools rely on databases with built-in automatic failover. Konstantin is
proposingto shorten this loop and make Postgres use its build-in automatic failover. 

So, existing tooling allows you to have 3 hosts for DCS, with majority of 2 hosts able to elect new leader in case of
failover.
And you can have only 2 hosts for Postgres - Primary and Standby. You can have 2 big Postgres machines with 64 CPUs.
And3 one-CPU hosts for Zookeper\etcd. 

If you use build-in failover you have to resort to 3 big Postgres machines because you need 2/3 majority. Of course,
youcan install MySQL-stype arbiter - host that had no real PGDATA, only participates in voting. But this is a solution
toproblem induced by built-in autofailover. 


Best regards, Andrey Borodin.


Re: Built-in Raft replication

From
Andrey Borodin
Date:

> On 16 Apr 2025, at 09:26, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Andrey Borodin <x4mmm@yandex-team.ru> writes:
>> I think it's what Konstantin is proposing. To have our own Raft implementation, without dependencies.
>
> Hmm, OK.  I thought that the proposal involved relying on some existing
> code, but re-reading the thread that was said nowhere.  Still, that
> moves it from a large project to a really large project :-(
>
> I continue to think that it'd be best to try to implement it as
> an extension, at least up till the point of finding show-stopping
> reasons why it cannot be that.

I think I can provide some reasons why it cannot be neither extension, nor any part running within postmaster reign.

1. When joining cluster, there’s not PGDATA to run postmaster on top of it.

2. After failover, old Primary node must rejoin cluster by running pg_rewind and following timeline switch.

The system in hand must be able to manipulate with PGDATA without starting Postgres.

My question to Konstantin is Why wouldn’t you just add Raft to Patroni? Is there a reason why something like Patroni is
notin core and noone rushes to get it in? 
Everyone is using it, or system like it.


Best regards, Andrey Borodin.


Re: Built-in Raft replication

From
Kirill Reshke
Date:
On Wed, 16 Apr 2025 at 10:25, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> I think I can provide some reasons why it cannot be neither extension, nor any part running within postmaster reign.
>
> 1. When joining cluster, there’s not PGDATA to run postmaster on top of it.

You can join the cluster on pg_basebackup of its master; So I dont get
why this is an anti-extension restriction.

> 2. After failover, old Primary node must rejoin cluster by running pg_rewind and following timeline switch.

You can run bash from extension, what's the point?

> The system in hand must be able to manipulate with PGDATA without starting Postgres.

--
Best regards,
Kirill Reshke



Re: Built-in Raft replication

From
Andrey Borodin
Date:

> On 16 Apr 2025, at 10:39, Kirill Reshke <reshkekirill@gmail.com> wrote:
> 
> You can run bash from extension, what's the point?

You cannot run bash that will stop backend running bash.


Best regards, Andrey Borodin.



Re: Built-in Raft replication

From
Ashutosh Bapat
Date:
On Wed, Apr 16, 2025 at 10:29 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> > We may build an extension which
> > has a similar role in PostgreSQL world as zookeeper in Hadoop.
>
> Patroni, pg_consul and others already use zookeeper, etcd and similar systems for consensus.
> Is it any better as extension than as etcd?

I feel so. An extension runs from within a postgresql process, uses
the same protocol as PostgreSQL whereas etcd is another process and
another protocol.

>
> > It can
> > be then used for other distributed systems as well - like shared
> > nothing clusters based on FDW.
>
> I didn’t get FDW analogy. Why other distributed systems should choose Postgres extension over Zookeeper?

By other distributed systems I mean PostgreSQL distributed systems -
FDW based native sharding or native replication or a system which uses
both.

>
> > There's already a proposal to bring
> > CREATE SERVER to the world of logical replication - so I see these two
> > worlds uniting in future.
>
> Again, I’m lost here. Which two worlds?

Logical replication and FDW based native sharding.

>
> > The
> > distributed system based on logical replication or FDW or both will
> > use this ensemble to manage its shared state. The same ensemble can be
> > shared across multiple distributed clusters if it has scaling
> > capabilities.
>
> Yes, shared DCS are common these days. AFAIK, we use one Zookeeper instance per hundred Postgres clusters to
coordinatepg_consuls. 
>
> Actually, scalability is opposite to topic of this thread. Let me explain.
> Currently, Postgres automatic failover tools rely on databases with built-in automatic failover. Konstantin is
proposingto shorten this loop and make Postgres use its build-in automatic failover. 
>
> So, existing tooling allows you to have 3 hosts for DCS, with majority of 2 hosts able to elect new leader in case of
failover.
> And you can have only 2 hosts for Postgres - Primary and Standby. You can have 2 big Postgres machines with 64 CPUs.
And3 one-CPU hosts for Zookeper\etcd. 
>
> If you use build-in failover you have to resort to 3 big Postgres machines because you need 2/3 majority. Of course,
youcan install MySQL-stype arbiter - host that had no real PGDATA, only participates in voting. But this is a solution
toproblem induced by built-in autofailover. 

Users find it a waste of resources to deploy 3 big PostgreSQL
instances just for HA where 2 suffice even if they deploy 3
lightweight DCS instances. Having only some of the nodes act as DCS
and others purely PostgreSQL nodes will reduce waste of resources.

--
Best Wishes,
Ashutosh Bapat



Re: Built-in Raft replication

From
Andrey Borodin
Date:

> On 16 Apr 2025, at 11:18, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
>
> Having only some of the nodes act as DCS
> and others purely PostgreSQL nodes will reduce waste of resources.

But typically you need more DCS nodes than PostgreSQL nodes. Did you mean
“Having only some of nodes act as PostgreSQL and others purely DCS nodes will reduce waste of resources”?


Best regards, Andrey Borodin.


Re: Built-in Raft replication

From
Ashutosh Bapat
Date:
On Wed, Apr 16, 2025 at 11:57 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
>
>
> > On 16 Apr 2025, at 11:18, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
> >
> > Having only some of the nodes act as DCS
> > and others purely PostgreSQL nodes will reduce waste of resources.
>
> But typically you need more DCS nodes than PostgreSQL nodes. Did you mean

In a small HA setup this might be true. But not when there are many
replicas. But ...

> “Having only some nodes act as PostgreSQL and others purely DCS nodes will reduce waste of resources”?

I mean, whatever the setup may be one shouldn't require to deploy a
big PostgreSQL server just because DCS needs majority.

--
Best Wishes,
Ashutosh Bapat



Re: Built-in Raft replication

From
Michael Banck
Date:
Hi,

On Wed, Apr 16, 2025 at 10:24:48AM +0500, Andrey Borodin wrote:
> I think I can provide some reasons why it cannot be neither extension,
> nor any part running within postmaster reign.
> 
> 1. When joining cluster, there’s not PGDATA to run postmaster on top
> of it.
> 
> 2. After failover, old Primary node must rejoin cluster by running
> pg_rewind and following timeline switch.
> 
> The system in hand must be able to manipulate with PGDATA without
> starting Postgres.

Yeah, while you could maybe implement some/all of the RAFT protocol in
an extension, actually building something useful on top with regards to
high availability or distributed whatever does not look feasible.
 
> My question to Konstantin is Why wouldn’t you just add Raft to
> Patroni?

Patroni can use pysyncobj, which is a Python implementation of RAFT, so
then you do not need an external RAFT provider like etcd, consul or
zookeeper. However, it is deemed deprecated by the Patroni authors due
to being difficult to debug when it breaks.

I guess a better Python implementation of RAFT for Patroni to use or
Patroni to implement it itself would help, but I believe nobody is
working on the latter right now, nor has any plans to do so. And there
also does not seem to be anybody working on a better pysyncobj.

> Is there a reason why something like Patroni is not in core and noone
> rushes to get it in?  Everyone is using it, or system like it.

Well, Patroni is written in Python, for starters.  It also does a lot
more than just leader election / cluster config. So I think nobody
seriously thought about proposing to put Patroni into core so far.

I guess the current proposal tries to be a step into the "something like
Patroni in core" if you tilt your head a little. It's just that the
whole thing would be a really big step for Postgres, maybe similar to
deciding we want in-core replication way back when.


Michael



Re: Built-in Raft replication

From
Konstantin Osipov
Date:
* Tom Lane <tgl@sss.pgh.pa.us> [25/04/16 11:05]:
> Nikolay Samokhvalov <nik@postgres.ai> writes:
> > This is exactly what I wanted to write as well. The idea is great. At the
> > same time, I think, consensus on many decisions will be extremely hard to
> > reach, so this project has a high risk of being very long. Unless it's an
> > extension, at least in the beginning.
> 
> Yeah.  The two questions you'd have to get past to get this into PG
> core are:
> 
> 1. Why can't it be an extension?  (You claimed it would work more
> seamlessly in core, but I don't think you've made a proven case.)

I think this can be best addressed when the discussion moves on to
an architecture design record, where the UX and implementation
details are outlined. I'm sure there can be a lot of bike-shedding
on that part. For now I merely wanted to know if:
- maybe there is a reason this will never be accepted
- maybe someone is already working on this.

From the replies I sense that while there is quite a bit of
scepticism about it ever making its way into the trunk, generally
there is no aversion to it. If my understanding is right,
it's a decent start.

> 2. Why depend on Raft rather than some other project?
> 
> Longtime PG developers are going to be particularly hard on point 2,
> because we have a track record now of outliving outside projects
> that we thought we could rely on.  One example here is the Snowball
> stemmer; while its upstream isn't quite dead, it's twitching only
> feebly, and seems to have a bus factor of 1.  Another example is the
> Spencer regex engine; we thought we could depend on Tcl to be the
> upstream for that, but for a decade or more they've acted as though
> *we* are the upstream.  And then there's libxml2.  And uuid-ossp.
> And autoconf.  And various documentation toolchains.  Need I go on?
> 
> The great advantage of implementing an outside dependency in an
> extension is that if the depended-on project dies, we can say a few
> words of mourning and move on.  It's a lot harder to walk away from
> in-core features.

Raft is an algorithm, not a library. For a quick start the project
could use an existing library - I'd pick tidb's raft-rs, which
happens to be implemented in Rust, but going forward I'd guess the
community will want to have a plain C implementation. 

There is a plethora of C implementations out there, but in my
somewhat educated opinion none are good enough for PostgreSQL
standards or purposes: ideally the protocol should be fully
isolated from storage and transport and extensively tested,
randomized & injection tests being a priority. Most of C
implementation I've seen are built by enthusiasts as a
self-education projects.

So at some point the project will need its own Raft
implementation. Good news is that the design of Raft internals
has been fairly well polished in all of the various
implementations in many different programming languages, so 
it should be a fairly straightforward job.

Regarding the maintenance, since its first publishing back in ~2010
the protocol stabilized quite a bit. The core of the protocol
doesn't get many changes, I'd say nearly no changes, and it's also 
noticeable in implementations, e.g. etcd-raft, raft-rs from tikv, etc
don't get many new commits nowadays.

Now a more broad question is whether or not Raft is an optimal
long term solution for log replication? Generally Raft is
leader-based, so in theory it could be replaced with a leader-less
protocol - e.g. FastPaxos, EPaxos, and newer
developments on top of those. To the best of my understanding all 
leader-less algorithms which provide a single round-trip commit cost
require co-designing the transaction and replication
layer - which may be a way more intrusive change than adding raft
on top of the existing synchronous replication in PostgreSQL.

Given that Raft already provides an amortized single-round-trip
commit time, and the goal is simplicity of UX and unification,
I'd say it's wise to wait and see for the leader-less approaches
to mature.

At the end of the day, there is always a trade-off of trying to 
do something today and waiting for perfection, but in case of Raft
in my personal opinion the balance is just right.

-- 
Konstantin Osipov, Moscow, Russia



Re: Built-in Raft replication

From
Konstantin Osipov
Date:
* Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> [25/04/16 11:06]:
> > My view is what Konstantin wants is automatic replication topology management. For some reason this technology is
calledHA, DCS, Raft, Paxos and many other scary words. But basically it manages primary_conn_info of some nodes to
providesome fault-tolerance properties. I'd start to design from here, not from Raft paper.
 
> >
> In my experience, the load of managing hundreds of replicas which all
> participate in RAFT protocol becomes more than regular transaction
> load. So making every replica a RAFT participant will affect the
> ability to deploy hundreds of replica.

I think this experience needs to be detailed out. There are
implementations in the field that are less efficient than others.

Early etcd-raft didn't have pre-voting and had "bastardized" 
(their own definition) implementation of configuration changes
which didn't use joint consensus.

Then there is a liveness issue if leader election is implemented
in a straightforward way in large clusters. But this is addressed:
scaling up the randomized election timeout with the cluster size,
converting most of participants to non-voters in large clusters. 

Raft replication, again, if implemented in a naive way, would
require a O(outstanding transaction) * number of replicas amount of
RAM. But that doesn't have to be naive.

To sum up, I am not aware of any principal limitations in this
area.

-- 
Konstantin Osipov, Moscow, Russia



Re: Built-in Raft replication

From
Konstantin Osipov
Date:
* Andrey Borodin <x4mmm@yandex-team.ru> [25/04/16 11:06]:
> > Andrey Borodin <x4mmm@yandex-team.ru> writes:
> >> I think it's what Konstantin is proposing. To have our own Raft implementation, without dependencies.
> > 
> > Hmm, OK.  I thought that the proposal involved relying on some existing
> > code, but re-reading the thread that was said nowhere.  Still, that
> > moves it from a large project to a really large project :-(
> > 
> > I continue to think that it'd be best to try to implement it as
> > an extension, at least up till the point of finding show-stopping
> > reasons why it cannot be that.
> 
> I think I can provide some reasons why it cannot be neither extension, nor any part running within postmaster reign.
> 
> 1. When joining cluster, there’s not PGDATA to run postmaster on top of it.
> 
> 2. After failover, old Primary node must rejoin cluster by running pg_rewind and following timeline switch.
> 
> The system in hand must be able to manipulate with PGDATA without starting Postgres.
> 
> My question to Konstantin is Why wouldn’t you just add Raft to Patroni? Is there a reason why something like Patroni
isnot in core and noone rushes to get it in?
 
> Everyone is using it, or system like it.

Raft uses the same WAL to store configuration change records as is used
for commit records. This is at the core of the correctness of the
algorithm. This is also my biggest concern with correctness of
Patroni - but to the best of my knowledge 's 90%+ of
use cases of Patroni use a "fixed" quorum size, that's defined at
start of the deployment and never/rarely changes.
Contrast to that being able to a replica to the quorum at any
time, and all it takes is just starting this replica and pointing
it at the existing cluster. This greatly simplifies UX.

-- 
Konstantin Osipov, Moscow, Russia



Re: Built-in Raft replication

From
Konstantin Osipov
Date:
* Andrey Borodin <x4mmm@yandex-team.ru> [25/04/16 11:06]:
> > You can run bash from extension, what's the point?
> 
> You cannot run bash that will stop backend running bash.

You're right there is a chicken and egg problem when you add 
Raft to an existing project, and rebootstrap 
becomes a trick, but it's a plumbing trick.

The new member needs to generate and persist a globally unique
identifier as the first step. Later it can 
reintroduce itself to the cluster given this identifier
can be preserved in the new incarnation (popen + fork).

-- 
Konstantin Osipov, Moscow, Russia