Re: Horizontal scalability/sharding - Mailing list pgsql-hackers
From | Josh Berkus |
---|---|
Subject | Re: Horizontal scalability/sharding |
Date | |
Msg-id | 55E62205.7000808@agliodbs.com Whole thread Raw |
In response to | Horizontal scalability/sharding (Bruce Momjian <bruce@momjian.us>) |
Responses |
Re: Horizontal scalability/sharding
|
List | pgsql-hackers |
On 09/01/2015 02:29 PM, Tomas Vondra wrote: > Hi, > > On 09/01/2015 09:19 PM, Josh Berkus wrote: >> Other way around, that is, having replication standbys as the only >> method of redundancy requires either high data loss or high latency >> for all writes. > > I haven't said that. I said that we should allow that topology, not that > it should be the only method of redundancy. Ah, OK, I didn't understand you. Of course I'm in favor of supporting both methods of redundancy if we can. >> In the case of sync rep, we are required to wait for at least double >> network lag time in order to do a single write ... making >> write-scalability quite difficult. > > Which assumes that latency (or rather the increase due to syncrep) is a > problem for the use case. Which may be the case for many use cases, but > certainly is not a problem for many BI/DWH use cases performing mostly > large batch loads. In those cases the network bandwidth may be quite > important resource. I'll argue that BI/DW is the least interesting use case for mainstream PostgreSQL because there are production-quality forks which do this (mostly propietary, but we can work on that). We really need a solution which works for OLTP. > For example assume that there are just two shards in two separate data > centers, connected by a link with limited bandwidth. Now, let's assume > you always keep a local replica for failover. So you have A1+A2 in DC1, > B1+B2 in DC2. If you're in DC1, then writing data to B1 means you also > have to write data to B2 and wait for it. So either you send the data to > each node separately (consuming 2x the bandwidth), or send it to B1 and > let it propagate to B2 e.g. through sync rep. > > So while you may be right in single-DC deployments, with multi-DC > deployments the situation is quite different - not only that the network > bandwidth is not unlimited, but because latencies within DC may be a > fraction of latencies between the locations (to the extent that the > increase due to syncrep may be just noise). So the local replication may > be actually way faster. I'm not seeing how the above is better using syncrep than using shard copying? > I can imagine forwarding the data between B1 and B2 even with a purely > sharding solution, but at that point you effectively re-implemented > syncrep. Not really, the mechanism is different and the behavior is different. One critical deficiency in using binary syncrep is that you can't do round-robin redundancy at all; every redundant node has to be an exact mirror of another node. In a good HA distributed system, you want multiple shards per node, and you want each shard to be replicated to a different node, so that in the event of node failure you're not dumping the full load on one other server. > IMHO the design has to address the multi-DC setups somehow. I think that > many of the customers who are so concerned about scaling to many shards > are also concerned about availability in case of DC outages, no? Certainly. But users located in a single DC shouldn't pay the same overhead as users who are geographically spread. > I don't follow. With sync rep we do know whether the copy is OK or not, > because the node either confirms writes or not. The failover certainly > is more complicated and is not immediate (to the extent of keeping a > copy at the sharding level), but it's a question of trade-offs. > > It's true we don't have auto-failover solution at the moment, but as I > said - I can easily imagine most people using just sharding, while some > deployments use syncrep with manual failover. As long as direct shard copying is available, I'm happy. I have no complaints about additional mechanisms. I'm bringing this up because the FDW proposal made at pgCon did not include *any* mechanism for HA/redundancy, just some handwaving about replication and/or BDR. This was one of the critical design failures of PostgresXC. A multinode system without automated node failover and replacement is a low-availability system. >> If we write to multiple copies as a part of the sharding feature, >> then that can be parallelized, so that we are waiting only as long as >> the slowest write (or in failure cases, as long as the shard >> timeout). Further, we can check for shard-copy health and update >> shard availability data with each user request, so that the ability >> to see stale/bad data is minimized. > > Again, this assumes infinite network bandwidth. In what way is the total network bandwitdh used in the system different for shard copying than for sync replication? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
pgsql-hackers by date: