Re: Horizontal scalability/sharding - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Horizontal scalability/sharding
Date
Msg-id 55E618D0.1090203@2ndquadrant.com
Whole thread Raw
In response to Re: Horizontal scalability/sharding  (Josh Berkus <josh@agliodbs.com>)
List pgsql-hackers
Hi,

On 09/01/2015 09:19 PM, Josh Berkus wrote:
> On 09/01/2015 11:36 AM, Tomas Vondra wrote:
>>> We want multiple copies of shards created by the sharding system
>>> itself. Having a separate, and completely orthagonal, redundancy
>>> system to the sharding system is overly burdensome on the DBA and
>>> makes low-data-loss HA impossible.
>>
>> IMHO it'd be quite unfortunate if the design would make it
>> impossible to combine those two features (e.g. creating standbys
>> for shards and failing over to them).
>>
>> It's true that solving HA at the sharding level (by keeping
>> multiple copies of a each shard) may be simpler than combining
>> sharding and standbys, but I don't see why it makes low-data-loss
>> HA impossible.
>
> Other way around, that is, having replication standbys as the only
> method of redundancy requires either high data loss or high latency
> for all writes.

I haven't said that. I said that we should allow that topology, not that 
it should be the only method of redundancy.

>
> In the case of async rep, every time we fail over a node, the entire
> cluser would need to roll back to the last common known-good replay
> point, hence high data loss.
>
> In the case of sync rep, we are required to wait for at least double
>  network lag time in order to do a single write ... making
> write-scalability quite difficult.

Which assumes that latency (or rather the increase due to syncrep) is a 
problem for the use case. Which may be the case for many use cases, but 
certainly is not a problem for many BI/DWH use cases performing mostly 
large batch loads. In those cases the network bandwidth may be quite 
important resource.

For example assume that there are just two shards in two separate data 
centers, connected by a link with limited bandwidth. Now, let's assume 
you always keep a local replica for failover. So you have A1+A2 in DC1, 
B1+B2 in DC2. If you're in DC1, then writing data to B1 means you also 
have to write data to B2 and wait for it. So either you send the data to 
each node separately (consuming 2x the bandwidth), or send it to B1 and 
let it propagate to B2 e.g. through sync rep.

So while you may be right in single-DC deployments, with multi-DC 
deployments the situation is quite different - not only that the network 
bandwidth is not unlimited, but because latencies within DC may be a 
fraction of latencies between the locations (to the extent that the 
increase due to syncrep may be just noise). So the local replication may 
be actually way faster.

I can imagine forwarding the data between B1 and B2 even with a purely 
sharding solution, but at that point you effectively re-implemented syncrep.

IMHO the design has to address the multi-DC setups somehow. I think that 
many of the customers who are so concerned about scaling to many shards 
are also concerned about availability in case of DC outages, no?

We should also consider support for custom topologies (not just a full 
mesh, or whatever we choose as the default/initial topology), which is 
somehow related.

>
> Futher, if using replication the sharding system would have no way
> to (a) find out immediately if a copy was bad and (b) fail over
> quickly to a copy of the shard if the first requested copy was not
> responding. With async replication, we also can't use multiple copies
> of the same shard as a way to balance read workloads.

I don't follow. With sync rep we do know whether the copy is OK or not, 
because the node either confirms writes or not. The failover certainly 
is more complicated and is not immediate (to the extent of keeping a 
copy at the sharding level), but it's a question of trade-offs.

It's true we don't have auto-failover solution at the moment, but as I 
said - I can easily imagine most people using just sharding, while some 
deployments use syncrep with manual failover.

>
> If we write to multiple copies as a part of the sharding feature,
> then that can be parallelized, so that we are waiting only as long as
> the slowest write (or in failure cases, as long as the shard
> timeout). Further, we can check for shard-copy health and update
> shard availability data with each user request, so that the ability
> to see stale/bad data is minimized.

Again, this assumes infinite network bandwidth.

>
> There are obvious problems with multiplexing writes, which you can
> figure out if you knock pg_shard around a bit. But I really think
> that solving those problems is the only way to go.
>
> Mind you, I see a strong place for binary replication and BDR for
> multi-region redundancy; you really don't want that to be part of
> the sharding system if you're aiming for write scalability.

I haven't mentioned BDR at all, and given the async nature I don't have 
a clear idea of how it fits into the sharding world at this point.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Kevin Grittner
Date:
Subject: Re: SimpleTee flush
Next
From: Bruce Momjian
Date:
Subject: Re: Horizontal scalability/sharding