Re: Horizontal scalability/sharding - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Horizontal scalability/sharding |
Date | |
Msg-id | CA+TgmoZuCo5qwFPtZhTFLyjU0LZ7zscMqpMkRvaWpCpWo=_Q9Q@mail.gmail.com Whole thread Raw |
In response to | Re: Horizontal scalability/sharding (Josh Berkus <josh@agliodbs.com>) |
Responses |
Re: Horizontal scalability/sharding
Re: Horizontal scalability/sharding |
List | pgsql-hackers |
On Wed, Sep 2, 2015 at 1:57 PM, Josh Berkus <josh@agliodbs.com> wrote: > Even if it's only on paper, any new sharding design needs to address > these questions: > > 1. How do we ensure no/minimal data is lost if we lose a node? > 2. How do we replace a lost node (without taking the cluster down)? > 2. a. how do we allow an out-of-sync node to "catch up"? > 3. How do we maintain metadata about good/bad nodes (and shard locations)? > 4. How do we add nodes to expand the cluster? > > There doesn't need to be code for all of the above from version 0.1, but > there needs to be a plan to tackle those problems. Otherwise, we'll > just end up with another dead-end, not-useful-in-production technology. This is a good point, and I think I agree with it. Let me make a few observations: 1. None of this stuff matters very much when the data is strictly read-only. You don't lose any data because you made enough copies at some point in the distant past to ensure that you wouldn't. You replace a lost node by taking anew copy. Nodes never need to catch up because there are no changes happening. To make bring up a new node, you make a copy of an existing node (which doesn't change in the meantime). So most of these concerns are about how to handle writes. 2. None of this stuff matters when you only have one copy of the data. Your system is low-availability, but you just don't care for whatever reason. The issue arises when you have multiple copies of the data, and the data is being changed. Now, you have to worry about the copies getting out of sync with each other, especially when failures happen. 3. IIUC, Postgres-XC handles this problem by reducing at least volatile functions, maybe all functions, to constants. Then it generates an SQL statement to be sent to the data node to make the appropriate change. If there's more than one copy of the data, we send a separate copy of the SQL statement to every node. I'm not sure exactly what happens if some of those nodes are not available, but I don't think it's anything good. Fundamentally, this model doesn't allow for many good options in that case. 4. Therefore, I think that we should instead use logical replication, which might be either synchronous or asynchronous. When you modify one copy of the data, that change will then be replicated to all other nodes. If you are OK with eventual consistency, this replication can be asynchronous, and nodes that are off-line will catch up when they are on-line. If you are not OK with that, then you must replicate synchronously to every node before transaction commit; or at least you must replicate synchronously to every node that is currently on-line. This presents some challenges: logical decoding currently can't replicate transactions that are still in process - replication starts when the transaction commits. Also, we don't have any way for synchronous replication to wait for multiple nodes. But in theory those seem like limitations that can be lifted. Also, the GTM needs to be aware that this stuff is happening, or it will DTWT. That too seems like a problem that can be solved. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: