Thread: Built-in Raft replication
Hi, I am considering starting work on implementing a built-in Raft replication for PostgreSQL. Raft's advantage is that it unifies log replication, cluster configuration/membership/topology management and initial state transfer into a single protocol. Currently the cluster configuration/topology is often managed by Patroni, or similar tools, however, it seems there are certain usability drawbacks with this approach: - it's a separate tool, requiring an external state provider like etcd; raft could store its configuration in system tables; this is also an observability improvement since everyone could look up cluster state the same way as everything else - same for watchdog; raft has a built-in failure detector that's configuration aware; - flexible quorums; currently quorum size is a configurable; with a built-in raft, extending the quorum could be a matter of starting a new node and pointing it to an existing cluster Going forward I can see PostgreSQL providing transparent bouncing on pg_wire level, given that Raft state is now part of the system, so drivers and all cluster nodes could easily see where the leader is. If anyone is working on Raft already I'd be happy to discuss the details. I am fairly new to the PostgreSQL hackers ecosystem so cautious of starting work in isolation/knowing there is no interest in accepting the feature into the trunk. Thanks, -- Konstantin Osipov
On Mon, 14 Apr 2025 at 22:15, Konstantin Osipov <kostja.osipov@gmail.com> wrote: > > Hi, Hi > I am considering starting work on implementing a built-in Raft > replication for PostgreSQL. > Just some thought on top of my mind, if you need my voice here: I have a hard time believing the community will be positive about this change in-core. It has more changes as contrib extension. In fact, if we want a built-in consensus algorithm, Paxos is a better option, because you can use postgresql as local crash-safe storage for single decree paxos, just store your state (ballot number, last voice) in a heap table. OTOH Raft needs to write its own log, and what's worse, it sometimes needs to remove already written parts of it (so, it is not appended only, unlike WAL). If you have a production system which maintains two kinds of logs with different semantics, it is a very hard system to maintain.. There is actually a prod-ready (non open source) implementation of RAFT as extension, called BiHA, by pgpro. Just some thought on top of my mind, if you need my voice here. -- Best regards, Kirill Reshke
* Kirill Reshke <reshkekirill@gmail.com> [25/04/14 20:48]: > > I am considering starting work on implementing a built-in Raft > > replication for PostgreSQL. > > > > Just some thought on top of my mind, if you need my voice here: > > I have a hard time believing the community will be positive about this > change in-core. It has more changes as contrib extension. In fact, if > we want a built-in consensus algorithm, Paxos is a better option, > because you can use postgresql as local crash-safe storage for single > decree paxos, just store your state (ballot number, last voice) in a > heap table. But Raft is a log replication algorithm, not a consensus algorithm. It does use consensus, but that's for leader election. Paxos could be used for log replication, but that would be expensive. In fact etcd uses Raft, and etcd is used by Patroni. So I completely lost your line of thought here. > OTOH Raft needs to write its own log, and what's worse, it sometimes > needs to remove already written parts of it (so, it is not appended > only, unlike WAL). If you have a production system which maintains two > kinds of logs with different semantics, it is a very hard system to > maintain.. My proposal is exactly to replace (or rather, extend) the current synchronous log replication with Raft. Entry removal is possible to stack on top of append-only format, and production implementations exist which do that. So, no, it's a single log, and in fact the current WAL will do. > There is actually a prod-ready (non open source) implementation of > RAFT as extension, called BiHA, by pgpro. My guess biha is an extension since a proprietary code is easier to maintain that way. I'd rather say the fact that there is a proprietary implementation out in the field confirms it could be a good idea to have it in PostgreSQL trunk. In any case I'm interested in contributing to the trunk, not building a proprietary module/fork. -- Konstantin Osipov
14.04.2025 20:44, Kirill Reshke пишет: > OTOH Raft needs to write its own log, and what's worse, it sometimes > needs to remove already written parts of it (so, it is not appended > only, unlike WAL). If you have a production system which maintains two > kinds of logs with different semantics, it is a very hard system to > maintain.. Raft is log replication protocol which uses log position and term. But... PostgreSQL already have log position and term in its WAL structure. PostgreSQL's timeline is actually the Term. Raft implementer needs just to correct rules for Term/Timeline switching: - instead of "next TimeLine number is just increment of largest known TimeLine number" it needs to be "next TimeLine number is the result of Leader Election". And yes, "it sometimes needs to remove already written parts of it". But... It is exactly what every PostgreSQL's cluster manager software have to do to join previous leader as a follower to new leader - pg_rewind. So, PostgreSQL already have 70-90%% of Raft implementation details. Raft doesn't have to be implemented in PostgreSQL. Raft has to be finished!!! PS: One of the biggest issues is forced snapshot on replica promotion. It really slows down leader switch time. It looks like it is not really needed, or some small workaround should be enough. -- regards Yura Sokolov aka funny-falcon
Hi Konstantin, > I am considering starting work on implementing a built-in Raft > replication for PostgreSQL. Generally speaking I like the idea. The more important question IMO is whether we want to maintain Raft within the PostgreSQL core project. Building distributed systems on commodity hardware was a popular idea back in the 2000s. These days you can rent a server with 2 Tb of RAM for something like 2000 USD/month (numbers from my memory that were valid ~5 years ago) which will fit many of the existing businesses (!) in memory. And you can rent another one for a replica, just in order not to recover from a backup if something happens to your primary server. The common wisdom is if you can avoid building distributed systems, don't build one. Which brings the question if we want to maintain something like this (which will include logic for cases when a node joins or leaves the cluster, proxy server / service discovery for clients, test cases / infrastructure for all this and also upgrading the cluster, docs, ...) for a presumably view users which business doesn't fit in a single server *and* they want an automatic failover (not the manual one) *and* they don't use Patroni/Stolon/CockroachDB/Neon/... already. Although the idea is tempting personally I'm inclined to think that it's better to invest community resources into something else. -- Best regards, Aleksander Alekseev
15.04.2025 13:20, Aleksander Alekseev пишет: > Hi Konstantin, > >> I am considering starting work on implementing a built-in Raft >> replication for PostgreSQL. > > Generally speaking I like the idea. The more important question IMO is > whether we want to maintain Raft within the PostgreSQL core project. > > Building distributed systems on commodity hardware was a popular idea > back in the 2000s. These days you can rent a server with 2 Tb of RAM > for something like 2000 USD/month (numbers from my memory that were > valid ~5 years ago) which will fit many of the existing businesses (!) > in memory. And you can rent another one for a replica, just in order > not to recover from a backup if something happens to your primary > server. The common wisdom is if you can avoid building distributed > systems, don't build one. > > Which brings the question if we want to maintain something like this > (which will include logic for cases when a node joins or leaves the > cluster, proxy server / service discovery for clients, test cases / > infrastructure for all this and also upgrading the cluster, docs, ...) > for a presumably view users which business doesn't fit in a single > server *and* they want an automatic failover (not the manual one) > *and* they don't use Patroni/Stolon/CockroachDB/Neon/... already. > > Although the idea is tempting personally I'm inclined to think that > it's better to invest community resources into something else. Raft is not for "commodity hardware". It is for reliability. Yes, it needs 3 servers instead of 2. It costs more than simple replication with "manual" failover. But if business needs high availability, it wouldn't rely on "manual" failover. And if business relies on correctness, it would rely on any solution which "automatically switches between two replicas". Because there is no way to guarantee correctness with just two replicas. And many stories of lost transactions with Patroni/Stolon already confirms this thesis. CockroachDB/Neon - they are good solutions for distributed systems. But, as you've said, many clients don't need distributed systems. They just need reliable replication. I've been working in a company which uses MongoDB (3.6 and up) as their primary storage. And it seemed to me as "God Send". Everything just worked. Replication was as reliable as one could imagine. It outlives several hardware incidents without manual intervention. It allowed cluster maintenance (software and hardware upgrades) without application downtime. I really dream PostgreSQL will be as reliable as MongoDB without need of external services. -- regards Yura Sokolov aka funny-falcon
* Yura Sokolov <y.sokolov@postgrespro.ru> [25/04/15 12:02]: > > OTOH Raft needs to write its own log, and what's worse, it sometimes > > needs to remove already written parts of it (so, it is not appended > > only, unlike WAL). If you have a production system which maintains two > > kinds of logs with different semantics, it is a very hard system to > > maintain.. > > Raft is log replication protocol which uses log position and term. > But... PostgreSQL already have log position and term in its WAL structure. > PostgreSQL's timeline is actually the Term. > Raft implementer needs just to correct rules for Term/Timeline switching: > - instead of "next TimeLine number is just increment of largest known > TimeLine number" it needs to be "next TimeLine number is the result of > Leader Election". > > And yes, "it sometimes needs to remove already written parts of it". > But... It is exactly what every PostgreSQL's cluster manager software have > to do to join previous leader as a follower to new leader - pg_rewind. > > So, PostgreSQL already have 70-90%% of Raft implementation details. > Raft doesn't have to be implemented in PostgreSQL. > Raft has to be finished!!! > > PS: One of the biggest issues is forced snapshot on replica promotion. It > really slows down leader switch time. It looks like it is not really > needed, or some small workaround should be enough. I'd say my pet peeve is storing the cluster topology (the so called raft configuration) inside the database, not in an external state provider. Agree on other points. -- Konstantin Osipov
Hi Yura, > I've been working in a company which uses MongoDB (3.6 and up) as their > primary storage. And it seemed to me as "God Send". Everything just worked. > Replication was as reliable as one could imagine. It outlives several > hardware incidents without manual intervention. It allowed cluster > maintenance (software and hardware upgrades) without application downtime. > I really dream PostgreSQL will be as reliable as MongoDB without need of > external services. I completely understand. I had exactly the same experience with Stolon. Everything just worked. And the setup took like 5 minutes. It's a pity this project doesn't seem to get as much attention as Patroni. Probably because attention requires traveling and presenting the project at conferences which costs money. Or perhaps people are just happy with Patroni. I'm not sure in which state Stolon is today. -- Best regards, Aleksander Alekseev
15.04.2025 14:15, Aleksander Alekseev пишет: > Hi Yura, > >> I've been working in a company which uses MongoDB (3.6 and up) as their >> primary storage. And it seemed to me as "God Send". Everything just worked. >> Replication was as reliable as one could imagine. It outlives several >> hardware incidents without manual intervention. It allowed cluster >> maintenance (software and hardware upgrades) without application downtime. >> I really dream PostgreSQL will be as reliable as MongoDB without need of >> external services. > > I completely understand. I had exactly the same experience with > Stolon. Everything just worked. And the setup took like 5 minutes. > > It's a pity this project doesn't seem to get as much attention as > Patroni. Probably because attention requires traveling and presenting > the project at conferences which costs money. Or perhaps people are > just happy with Patroni. I'm not sure in which state Stolon is today. But the key point: if PostgreSQL will be improved a bit, there will be no need neither in Patroni, nor in Stolon. Isn't it great? -- regards Yura Sokolov aka funny-falcon
* Aleksander Alekseev <aleksander@timescale.com> [25/04/15 13:20]: > > I am considering starting work on implementing a built-in Raft > > replication for PostgreSQL. > > Generally speaking I like the idea. The more important question IMO is > whether we want to maintain Raft within the PostgreSQL core project. > > Building distributed systems on commodity hardware was a popular idea > back in the 2000s. These days you can rent a server with 2 Tb of RAM > for something like 2000 USD/month (numbers from my memory that were > valid ~5 years ago) which will fit many of the existing businesses (!) > in memory. And you can rent another one for a replica, just in order > not to recover from a backup if something happens to your primary > server. The common wisdom is if you can avoid building distributed > systems, don't build one. > > Which brings the question if we want to maintain something like this > (which will include logic for cases when a node joins or leaves the > cluster, proxy server / service discovery for clients, test cases / > infrastructure for all this and also upgrading the cluster, docs, ...) > for a presumably view users which business doesn't fit in a single > server *and* they want an automatic failover (not the manual one) > *and* they don't use Patroni/Stolon/CockroachDB/Neon/... already. > > Although the idea is tempting personally I'm inclined to think that > it's better to invest community resources into something else. My personal take away from this as a community member would be seamless coordinator failover in Greenplum and all of its forks (CloudBerry, Greengage, synxdata, what not). I also imagine there is a number of PostgreSQL derivatives that could benefit from built-in transparent failover since it standardizes the solution space. -- Konstantin Osipov
* Yura Sokolov <y.sokolov@postgrespro.ru> [25/04/15 14:02]: > I've been working in a company which uses MongoDB (3.6 and up) as their > primary storage. And it seemed to me as "God Send". Everything just worked. > Replication was as reliable as one could imagine. It outlives several > hardware incidents without manual intervention. It allowed cluster > maintenance (software and hardware upgrades) without application downtime. > I really dream PostgreSQL will be as reliable as MongoDB without need of > external services. thanks for pointing out mongodb, so built-in raft would help ferretdb as well. -- Konstantin Osipov
On Mon, Apr 14, 2025 at 1:15 PM Konstantin Osipov <kostja.osipov@gmail.com> wrote:
If anyone is working on Raft already I'd be happy to discuss
the details. I am fairly new to the PostgreSQL hackers ecosystem
so cautious of starting work in isolation/knowing there is no
interest in accepting the feature into the trunk.
Putting aside the technical concerns about this specific idea, it's best to start by laying out a very detailed plan of what you would want to change, and what you see as the costs and benefits. It's also extremely helpful to think about developing this as an extension. If you get stuck due to extension limitations, propose additional hooks. If the hooks will not work, explain why.
Getting this into core is going to be a long, multi-year effort, in which people are going to be pushing back the entire time, so prepare yourself for that. My immediate retort is going to be: why would we add this if there are existing tools that already do the job just fine? Postgres has lots of tasks that it is happy to let other programs/OS subsystems/extensions/etc. handle instead.
Cheers,
Greg
--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support
* Greg Sabino Mullane <htamfids@gmail.com> [25/04/15 18:08]: > > If anyone is working on Raft already I'd be happy to discuss > > the details. I am fairly new to the PostgreSQL hackers ecosystem > > so cautious of starting work in isolation/knowing there is no > > interest in accepting the feature into the trunk. > > > > Putting aside the technical concerns about this specific idea, it's best to > start by laying out a very detailed plan of what you would want to change, > and what you see as the costs and benefits. It's also extremely helpful to > think about developing this as an extension. If you get stuck due to > extension limitations, propose additional hooks. If the hooks will not > work, explain why. > > Getting this into core is going to be a long, multi-year effort, in which > people are going to be pushing back the entire time, so prepare yourself > for that. My immediate retort is going to be: why would we add this if > there are existing tools that already do the job just fine? Postgres has > lots of tasks that it is happy to let other programs/OS > subsystems/extensions/etc. handle instead. I had hoped I explained why external state providers can not provide the same seamless UX as built-in ones. The key idea is to have a built-in configuration management, so that adding and removing replicas does not require changes in multiple disjoint parts of the installation (server configurations, proxies, clients). I understand and accept that it's a multi-year effort, but I do not accept the retort - my main point is that external tools are not a replacement, and I'd like to reach consensus on that. -- Konstantin Osipov, Moscow, Russia
On Tue, Apr 15, 2025 at 8:08 AM Greg Sabino Mullane <htamfids@gmail.com> wrote:
On Mon, Apr 14, 2025 at 1:15 PM Konstantin Osipov <kostja.osipov@gmail.com> wrote:If anyone is working on Raft already I'd be happy to discuss
the details. I am fairly new to the PostgreSQL hackers ecosystem
so cautious of starting work in isolation/knowing there is no
interest in accepting the feature into the trunk.Putting aside the technical concerns about this specific idea, it's best to start by laying out a very detailed plan of what you would want to change, and what you see as the costs and benefits. It's also extremely helpful to think about developing this as an extension. If you get stuck due to extension limitations, propose additional hooks. If the hooks will not work, explain why.
This is exactly what I wanted to write as well. The idea is great. At the same time, I think, consensus on many decisions will be extremely hard to reach, so this project has a high risk of being very long. Unless it's an extension, at least in the beginning.
Nik
Nikolay Samokhvalov <nik@postgres.ai> writes: > This is exactly what I wanted to write as well. The idea is great. At the > same time, I think, consensus on many decisions will be extremely hard to > reach, so this project has a high risk of being very long. Unless it's an > extension, at least in the beginning. Yeah. The two questions you'd have to get past to get this into PG core are: 1. Why can't it be an extension? (You claimed it would work more seamlessly in core, but I don't think you've made a proven case.) 2. Why depend on Raft rather than some other project? Longtime PG developers are going to be particularly hard on point 2, because we have a track record now of outliving outside projects that we thought we could rely on. One example here is the Snowball stemmer; while its upstream isn't quite dead, it's twitching only feebly, and seems to have a bus factor of 1. Another example is the Spencer regex engine; we thought we could depend on Tcl to be the upstream for that, but for a decade or more they've acted as though *we* are the upstream. And then there's libxml2. And uuid-ossp. And autoconf. And various documentation toolchains. Need I go on? The great advantage of implementing an outside dependency in an extension is that if the depended-on project dies, we can say a few words of mourning and move on. It's a lot harder to walk away from in-core features. regards, tom lane
> On 16 Apr 2025, at 04:19, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > feebly, and seems to have a bus factor of 1. Another example is the > Spencer regex engine; we thought we could depend on Tcl to be the > upstream for that, but for a decade or more they've acted as though > *we* are the upstream. I think it's what Konstantin is proposing. To have our own Raft implementation, without dependencies. IMO to better understand what is proposed we need some more description of proposed systems. How the new system will be configured?initdb and what than? How new node joins cluster? What is running pg_rewind when necessary? Some time ago Peter E proposed to be able to start replication atop of empty directory, so that initial sync would be morestraightforward. And also Heikki proposed to remove archive race condition when choosing new timeline. I think this stepsare gradual movement in the same direction. My view is what Konstantin wants is automatic replication topology management. For some reason this technology is calledHA, DCS, Raft, Paxos and many other scary words. But basically it manages primary_conn_info of some nodes to providesome fault-tolerance properties. I'd start to design from here, not from Raft paper. Best regards, Andrey Borodin.
Andrey Borodin <x4mmm@yandex-team.ru> writes: > I think it's what Konstantin is proposing. To have our own Raft implementation, without dependencies. Hmm, OK. I thought that the proposal involved relying on some existing code, but re-reading the thread that was said nowhere. Still, that moves it from a large project to a really large project :-( I continue to think that it'd be best to try to implement it as an extension, at least up till the point of finding show-stopping reasons why it cannot be that. regards, tom lane
On Wed, Apr 16, 2025 at 9:37 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > My view is what Konstantin wants is automatic replication topology management. For some reason this technology is calledHA, DCS, Raft, Paxos and many other scary words. But basically it manages primary_conn_info of some nodes to providesome fault-tolerance properties. I'd start to design from here, not from Raft paper. > In my experience, the load of managing hundreds of replicas which all participate in RAFT protocol becomes more than regular transaction load. So making every replica a RAFT participant will affect the ability to deploy hundreds of replica. We may build an extension which has a similar role in PostgreSQL world as zookeeper in Hadoop. It can be then used for other distributed systems as well - like shared nothing clusters based on FDW. There's already a proposal to bring CREATE SERVER to the world of logical replication - so I see these two worlds uniting in future. The way I imagine it is some PostgreSQL instances, which have this extension installed, will act as a RAFT cluster (similar to Zookeeper ensemble or etcd cluster). The distributed system based on logical replication or FDW or both will use this ensemble to manage its shared state. The same ensemble can be shared across multiple distributed clusters if it has scaling capabilities. -- Best Wishes, Ashutosh Bapat
> On 16 Apr 2025, at 09:33, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > In my experience, the load of managing hundreds of replicas which all > participate in RAFT protocol becomes more than regular transaction > load. So making every replica a RAFT participant will affect the > ability to deploy hundreds of replica. No need to make all standbys voting. And no need to make plain topology. pg_consul is using 2/3 or 3/5 HA groups, and cascadesall others from HA group. Existing tools already solve the original problem, Konstantin is just proposing to solve it in some standard “official” way. > We may build an extension which > has a similar role in PostgreSQL world as zookeeper in Hadoop. Patroni, pg_consul and others already use zookeeper, etcd and similar systems for consensus. Is it any better as extension than as etcd? > It can > be then used for other distributed systems as well - like shared > nothing clusters based on FDW. I didn’t get FDW analogy. Why other distributed systems should choose Postgres extension over Zookeeper? > There's already a proposal to bring > CREATE SERVER to the world of logical replication - so I see these two > worlds uniting in future. Again, I’m lost here. Which two worlds? > The way I imagine it is some PostgreSQL > instances, which have this extension installed, will act as a RAFT > cluster (similar to Zookeeper ensemble or etcd cluster). That’s exactly what is proposed here. > The > distributed system based on logical replication or FDW or both will > use this ensemble to manage its shared state. The same ensemble can be > shared across multiple distributed clusters if it has scaling > capabilities. Yes, shared DCS are common these days. AFAIK, we use one Zookeeper instance per hundred Postgres clusters to coordinate pg_consuls. Actually, scalability is opposite to topic of this thread. Let me explain. Currently, Postgres automatic failover tools rely on databases with built-in automatic failover. Konstantin is proposingto shorten this loop and make Postgres use its build-in automatic failover. So, existing tooling allows you to have 3 hosts for DCS, with majority of 2 hosts able to elect new leader in case of failover. And you can have only 2 hosts for Postgres - Primary and Standby. You can have 2 big Postgres machines with 64 CPUs. And3 one-CPU hosts for Zookeper\etcd. If you use build-in failover you have to resort to 3 big Postgres machines because you need 2/3 majority. Of course, youcan install MySQL-stype arbiter - host that had no real PGDATA, only participates in voting. But this is a solution toproblem induced by built-in autofailover. Best regards, Andrey Borodin.
> On 16 Apr 2025, at 09:26, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Andrey Borodin <x4mmm@yandex-team.ru> writes: >> I think it's what Konstantin is proposing. To have our own Raft implementation, without dependencies. > > Hmm, OK. I thought that the proposal involved relying on some existing > code, but re-reading the thread that was said nowhere. Still, that > moves it from a large project to a really large project :-( > > I continue to think that it'd be best to try to implement it as > an extension, at least up till the point of finding show-stopping > reasons why it cannot be that. I think I can provide some reasons why it cannot be neither extension, nor any part running within postmaster reign. 1. When joining cluster, there’s not PGDATA to run postmaster on top of it. 2. After failover, old Primary node must rejoin cluster by running pg_rewind and following timeline switch. The system in hand must be able to manipulate with PGDATA without starting Postgres. My question to Konstantin is Why wouldn’t you just add Raft to Patroni? Is there a reason why something like Patroni is notin core and noone rushes to get it in? Everyone is using it, or system like it. Best regards, Andrey Borodin.
On Wed, 16 Apr 2025 at 10:25, Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > I think I can provide some reasons why it cannot be neither extension, nor any part running within postmaster reign. > > 1. When joining cluster, there’s not PGDATA to run postmaster on top of it. You can join the cluster on pg_basebackup of its master; So I dont get why this is an anti-extension restriction. > 2. After failover, old Primary node must rejoin cluster by running pg_rewind and following timeline switch. You can run bash from extension, what's the point? > The system in hand must be able to manipulate with PGDATA without starting Postgres. -- Best regards, Kirill Reshke
> On 16 Apr 2025, at 10:39, Kirill Reshke <reshkekirill@gmail.com> wrote: > > You can run bash from extension, what's the point? You cannot run bash that will stop backend running bash. Best regards, Andrey Borodin.
On Wed, Apr 16, 2025 at 10:29 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > > We may build an extension which > > has a similar role in PostgreSQL world as zookeeper in Hadoop. > > Patroni, pg_consul and others already use zookeeper, etcd and similar systems for consensus. > Is it any better as extension than as etcd? I feel so. An extension runs from within a postgresql process, uses the same protocol as PostgreSQL whereas etcd is another process and another protocol. > > > It can > > be then used for other distributed systems as well - like shared > > nothing clusters based on FDW. > > I didn’t get FDW analogy. Why other distributed systems should choose Postgres extension over Zookeeper? By other distributed systems I mean PostgreSQL distributed systems - FDW based native sharding or native replication or a system which uses both. > > > There's already a proposal to bring > > CREATE SERVER to the world of logical replication - so I see these two > > worlds uniting in future. > > Again, I’m lost here. Which two worlds? Logical replication and FDW based native sharding. > > > The > > distributed system based on logical replication or FDW or both will > > use this ensemble to manage its shared state. The same ensemble can be > > shared across multiple distributed clusters if it has scaling > > capabilities. > > Yes, shared DCS are common these days. AFAIK, we use one Zookeeper instance per hundred Postgres clusters to coordinatepg_consuls. > > Actually, scalability is opposite to topic of this thread. Let me explain. > Currently, Postgres automatic failover tools rely on databases with built-in automatic failover. Konstantin is proposingto shorten this loop and make Postgres use its build-in automatic failover. > > So, existing tooling allows you to have 3 hosts for DCS, with majority of 2 hosts able to elect new leader in case of failover. > And you can have only 2 hosts for Postgres - Primary and Standby. You can have 2 big Postgres machines with 64 CPUs. And3 one-CPU hosts for Zookeper\etcd. > > If you use build-in failover you have to resort to 3 big Postgres machines because you need 2/3 majority. Of course, youcan install MySQL-stype arbiter - host that had no real PGDATA, only participates in voting. But this is a solution toproblem induced by built-in autofailover. Users find it a waste of resources to deploy 3 big PostgreSQL instances just for HA where 2 suffice even if they deploy 3 lightweight DCS instances. Having only some of the nodes act as DCS and others purely PostgreSQL nodes will reduce waste of resources. -- Best Wishes, Ashutosh Bapat
> On 16 Apr 2025, at 11:18, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > Having only some of the nodes act as DCS > and others purely PostgreSQL nodes will reduce waste of resources. But typically you need more DCS nodes than PostgreSQL nodes. Did you mean “Having only some of nodes act as PostgreSQL and others purely DCS nodes will reduce waste of resources”? Best regards, Andrey Borodin.
On Wed, Apr 16, 2025 at 11:57 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > > > > On 16 Apr 2025, at 11:18, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > > > Having only some of the nodes act as DCS > > and others purely PostgreSQL nodes will reduce waste of resources. > > But typically you need more DCS nodes than PostgreSQL nodes. Did you mean In a small HA setup this might be true. But not when there are many replicas. But ... > “Having only some nodes act as PostgreSQL and others purely DCS nodes will reduce waste of resources”? I mean, whatever the setup may be one shouldn't require to deploy a big PostgreSQL server just because DCS needs majority. -- Best Wishes, Ashutosh Bapat
Hi, On Wed, Apr 16, 2025 at 10:24:48AM +0500, Andrey Borodin wrote: > I think I can provide some reasons why it cannot be neither extension, > nor any part running within postmaster reign. > > 1. When joining cluster, there’s not PGDATA to run postmaster on top > of it. > > 2. After failover, old Primary node must rejoin cluster by running > pg_rewind and following timeline switch. > > The system in hand must be able to manipulate with PGDATA without > starting Postgres. Yeah, while you could maybe implement some/all of the RAFT protocol in an extension, actually building something useful on top with regards to high availability or distributed whatever does not look feasible. > My question to Konstantin is Why wouldn’t you just add Raft to > Patroni? Patroni can use pysyncobj, which is a Python implementation of RAFT, so then you do not need an external RAFT provider like etcd, consul or zookeeper. However, it is deemed deprecated by the Patroni authors due to being difficult to debug when it breaks. I guess a better Python implementation of RAFT for Patroni to use or Patroni to implement it itself would help, but I believe nobody is working on the latter right now, nor has any plans to do so. And there also does not seem to be anybody working on a better pysyncobj. > Is there a reason why something like Patroni is not in core and noone > rushes to get it in? Everyone is using it, or system like it. Well, Patroni is written in Python, for starters. It also does a lot more than just leader election / cluster config. So I think nobody seriously thought about proposing to put Patroni into core so far. I guess the current proposal tries to be a step into the "something like Patroni in core" if you tilt your head a little. It's just that the whole thing would be a really big step for Postgres, maybe similar to deciding we want in-core replication way back when. Michael
* Tom Lane <tgl@sss.pgh.pa.us> [25/04/16 11:05]: > Nikolay Samokhvalov <nik@postgres.ai> writes: > > This is exactly what I wanted to write as well. The idea is great. At the > > same time, I think, consensus on many decisions will be extremely hard to > > reach, so this project has a high risk of being very long. Unless it's an > > extension, at least in the beginning. > > Yeah. The two questions you'd have to get past to get this into PG > core are: > > 1. Why can't it be an extension? (You claimed it would work more > seamlessly in core, but I don't think you've made a proven case.) I think this can be best addressed when the discussion moves on to an architecture design record, where the UX and implementation details are outlined. I'm sure there can be a lot of bike-shedding on that part. For now I merely wanted to know if: - maybe there is a reason this will never be accepted - maybe someone is already working on this. From the replies I sense that while there is quite a bit of scepticism about it ever making its way into the trunk, generally there is no aversion to it. If my understanding is right, it's a decent start. > 2. Why depend on Raft rather than some other project? > > Longtime PG developers are going to be particularly hard on point 2, > because we have a track record now of outliving outside projects > that we thought we could rely on. One example here is the Snowball > stemmer; while its upstream isn't quite dead, it's twitching only > feebly, and seems to have a bus factor of 1. Another example is the > Spencer regex engine; we thought we could depend on Tcl to be the > upstream for that, but for a decade or more they've acted as though > *we* are the upstream. And then there's libxml2. And uuid-ossp. > And autoconf. And various documentation toolchains. Need I go on? > > The great advantage of implementing an outside dependency in an > extension is that if the depended-on project dies, we can say a few > words of mourning and move on. It's a lot harder to walk away from > in-core features. Raft is an algorithm, not a library. For a quick start the project could use an existing library - I'd pick tidb's raft-rs, which happens to be implemented in Rust, but going forward I'd guess the community will want to have a plain C implementation. There is a plethora of C implementations out there, but in my somewhat educated opinion none are good enough for PostgreSQL standards or purposes: ideally the protocol should be fully isolated from storage and transport and extensively tested, randomized & injection tests being a priority. Most of C implementation I've seen are built by enthusiasts as a self-education projects. So at some point the project will need its own Raft implementation. Good news is that the design of Raft internals has been fairly well polished in all of the various implementations in many different programming languages, so it should be a fairly straightforward job. Regarding the maintenance, since its first publishing back in ~2010 the protocol stabilized quite a bit. The core of the protocol doesn't get many changes, I'd say nearly no changes, and it's also noticeable in implementations, e.g. etcd-raft, raft-rs from tikv, etc don't get many new commits nowadays. Now a more broad question is whether or not Raft is an optimal long term solution for log replication? Generally Raft is leader-based, so in theory it could be replaced with a leader-less protocol - e.g. FastPaxos, EPaxos, and newer developments on top of those. To the best of my understanding all leader-less algorithms which provide a single round-trip commit cost require co-designing the transaction and replication layer - which may be a way more intrusive change than adding raft on top of the existing synchronous replication in PostgreSQL. Given that Raft already provides an amortized single-round-trip commit time, and the goal is simplicity of UX and unification, I'd say it's wise to wait and see for the leader-less approaches to mature. At the end of the day, there is always a trade-off of trying to do something today and waiting for perfection, but in case of Raft in my personal opinion the balance is just right. -- Konstantin Osipov, Moscow, Russia
* Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> [25/04/16 11:06]: > > My view is what Konstantin wants is automatic replication topology management. For some reason this technology is calledHA, DCS, Raft, Paxos and many other scary words. But basically it manages primary_conn_info of some nodes to providesome fault-tolerance properties. I'd start to design from here, not from Raft paper. > > > In my experience, the load of managing hundreds of replicas which all > participate in RAFT protocol becomes more than regular transaction > load. So making every replica a RAFT participant will affect the > ability to deploy hundreds of replica. I think this experience needs to be detailed out. There are implementations in the field that are less efficient than others. Early etcd-raft didn't have pre-voting and had "bastardized" (their own definition) implementation of configuration changes which didn't use joint consensus. Then there is a liveness issue if leader election is implemented in a straightforward way in large clusters. But this is addressed: scaling up the randomized election timeout with the cluster size, converting most of participants to non-voters in large clusters. Raft replication, again, if implemented in a naive way, would require a O(outstanding transaction) * number of replicas amount of RAM. But that doesn't have to be naive. To sum up, I am not aware of any principal limitations in this area. -- Konstantin Osipov, Moscow, Russia
* Andrey Borodin <x4mmm@yandex-team.ru> [25/04/16 11:06]: > > Andrey Borodin <x4mmm@yandex-team.ru> writes: > >> I think it's what Konstantin is proposing. To have our own Raft implementation, without dependencies. > > > > Hmm, OK. I thought that the proposal involved relying on some existing > > code, but re-reading the thread that was said nowhere. Still, that > > moves it from a large project to a really large project :-( > > > > I continue to think that it'd be best to try to implement it as > > an extension, at least up till the point of finding show-stopping > > reasons why it cannot be that. > > I think I can provide some reasons why it cannot be neither extension, nor any part running within postmaster reign. > > 1. When joining cluster, there’s not PGDATA to run postmaster on top of it. > > 2. After failover, old Primary node must rejoin cluster by running pg_rewind and following timeline switch. > > The system in hand must be able to manipulate with PGDATA without starting Postgres. > > My question to Konstantin is Why wouldn’t you just add Raft to Patroni? Is there a reason why something like Patroni isnot in core and noone rushes to get it in? > Everyone is using it, or system like it. Raft uses the same WAL to store configuration change records as is used for commit records. This is at the core of the correctness of the algorithm. This is also my biggest concern with correctness of Patroni - but to the best of my knowledge 's 90%+ of use cases of Patroni use a "fixed" quorum size, that's defined at start of the deployment and never/rarely changes. Contrast to that being able to a replica to the quorum at any time, and all it takes is just starting this replica and pointing it at the existing cluster. This greatly simplifies UX. -- Konstantin Osipov, Moscow, Russia
* Andrey Borodin <x4mmm@yandex-team.ru> [25/04/16 11:06]: > > You can run bash from extension, what's the point? > > You cannot run bash that will stop backend running bash. You're right there is a chicken and egg problem when you add Raft to an existing project, and rebootstrap becomes a trick, but it's a plumbing trick. The new member needs to generate and persist a globally unique identifier as the first step. Later it can reintroduce itself to the cluster given this identifier can be preserved in the new incarnation (popen + fork). -- Konstantin Osipov, Moscow, Russia