Re: Horizontal scalability/sharding - Mailing list pgsql-hackers
From | Josh Berkus |
---|---|
Subject | Re: Horizontal scalability/sharding |
Date | |
Msg-id | 55E74808.6050609@agliodbs.com Whole thread Raw |
In response to | Horizontal scalability/sharding (Bruce Momjian <bruce@momjian.us>) |
Responses |
Re: Horizontal scalability/sharding
Re: Horizontal scalability/sharding Re: Horizontal scalability/sharding |
List | pgsql-hackers |
On 09/02/2015 11:41 AM, Robert Haas wrote: > On Wed, Sep 2, 2015 at 1:57 PM, Josh Berkus <josh@agliodbs.com> wrote: >> Even if it's only on paper, any new sharding design needs to address >> these questions: >> >> 1. How do we ensure no/minimal data is lost if we lose a node? >> 2. How do we replace a lost node (without taking the cluster down)? >> 2. a. how do we allow an out-of-sync node to "catch up"? >> 3. How do we maintain metadata about good/bad nodes (and shard locations)? >> 4. How do we add nodes to expand the cluster? >> >> There doesn't need to be code for all of the above from version 0.1, but >> there needs to be a plan to tackle those problems. Otherwise, we'll >> just end up with another dead-end, not-useful-in-production technology. > > This is a good point, and I think I agree with it. Let me make a few > observations: > > 1. None of this stuff matters very much when the data is strictly > read-only. Yep. > 2. None of this stuff matters when you only have one copy of the data. > Your system is low-availability, but you just don't care for whatever > reason. Uh-huh. > 3. IIUC, Postgres-XC handles this problem by reducing at least > volatile functions, maybe all functions, to constants. Then it > generates an SQL statement to be sent to the data node to make the > appropriate change. If there's more than one copy of the data, we > send a separate copy of the SQL statement to every node. I'm not sure > exactly what happens if some of those nodes are not available, but I > don't think it's anything good. Fundamentally, this model doesn't > allow for many good options in that case. pg_shard also sends the data to each node, and automatically notices which nodes are not responding and takes them out of availability. There isn't a "catch up" feature yet (AFAIK), or any attempt to reduce volatile functions. For that matter, last I worked on it Greenplum also did multiplexing via the writing node (or via the data loader). So this is a popular approach; it has a number of drawbacks, though, of which volatile functions are a major one. > 4. Therefore, I think that we should instead use logical replication, > which might be either synchronous or asynchronous. When you modify > one copy of the data, that change will then be replicated to all other > nodes. If you are OK with eventual consistency, this replication can > be asynchronous, and nodes that are off-line will catch up when they > are on-line. If you are not OK with that, then you must replicate > synchronously to every node before transaction commit; or at least you > must replicate synchronously to every node that is currently on-line. > This presents some challenges: logical decoding currently can't > replicate transactions that are still in process - replication starts > when the transaction commits. Also, we don't have any way for > synchronous replication to wait for multiple nodes. Well, there is a WIP patch for that, which IMHO would be much improved by having a concrete use-case like this one. What nobody is working on -- and we've vetoed in the past -- is a way of automatically failing and removing from replication any node which repeatedly fails to sync, which would be a requirement for this model. You'd also need a way to let the connection nodes know when a replica has fallen behind so that they can be taken out of load-balancing/sharding for read queries. For the synchronous model, that would be "fallen behind at all"; for asynchronous it would be "fallen more than ### behind". > But in theory > those seem like limitations that can be lifted. Also, the GTM needs > to be aware that this stuff is happening, or it will DTWT. That too > seems like a problem that can be solved. Yeah? I'd assume that a GTM would be antithetical to two-stage copying.I'm not a big fan of a GTM at all, frankly; it makesclusters much harder to set up, and becomes a SPOF. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
pgsql-hackers by date: