Re: Horizontal scalability/sharding - Mailing list pgsql-hackers

From Petr Jelinek
Subject Re: Horizontal scalability/sharding
Date
Msg-id CALLjQTR1D=0kxBsa1Ba-LDDJdkhxpw61YVE3_Dt6xTBN2ag3Qw@mail.gmail.com
Whole thread Raw
In response to Re: Horizontal scalability/sharding  (Josh Berkus <josh@agliodbs.com>)
List pgsql-hackers
On 2015-09-02 00:09, Josh Berkus wrote:
> On 09/01/2015 02:29 PM, Tomas Vondra wrote:
>> For example assume that there are just two shards in two separate data
>> centers, connected by a link with limited bandwidth. Now, let's assume
>> you always keep a local replica for failover. So you have A1+A2 in DC1,
>> B1+B2 in DC2. If you're in DC1, then writing data to B1 means you also
>> have to write data to B2 and wait for it. So either you send the data to
>> each node separately (consuming 2x the bandwidth), or send it to B1 and
>> let it propagate to B2 e.g. through sync rep.
>>
>> So while you may be right in single-DC deployments, with multi-DC
>> deployments the situation is quite different - not only that the network
>> bandwidth is not unlimited, but because latencies within DC may be a
>> fraction of latencies between the locations (to the extent that the
>> increase due to syncrep may be just noise). So the local replication may
>> be actually way faster.
>
> I'm not seeing how the above is better using syncrep than using shard
> copying?

Shard copying usually assumes that the origin node does the copy - the
data has to go twice through the slow connection. With replication you
can replicate locally over fast connection.

>
>> I can imagine forwarding the data between B1 and B2 even with a purely
>> sharding solution, but at that point you effectively re-implemented
>> syncrep.
>
> Not really, the mechanism is different and the behavior is different.
> One critical deficiency in using binary syncrep is that you can't do
> round-robin redundancy at all; every redundant node has to be an exact
> mirror of another node.  In a good HA distributed system, you want
> multiple shards per node, and you want each shard to be replicated to a
> different node, so that in the event of node failure you're not dumping
> the full load on one other server.
>

This assumes that we use binary replication, but we can reasonably use
logical replication which can quite easily do filtering of what's
replicated where.

>> IMHO the design has to address the multi-DC setups somehow. I think that
>> many of the customers who are so concerned about scaling to many shards
>> are also concerned about availability in case of DC outages, no?
>
> Certainly.  But users located in a single DC shouldn't pay the same
> overhead as users who are geographically spread.
>

Agreed, so we should support both ways, but I don't think it's necessary
to support both ways in version 0.1. It's just important to not paint
ourselves into a corner with design decisions that would make one of the
ways impossible.


>>> If we write to multiple copies as a part of the sharding feature,
>>> then that can be parallelized, so that we are waiting only as long as
>>> the slowest write (or in failure cases, as long as the shard
>>> timeout). Further, we can check for shard-copy health and update
>>> shard availability data with each user request, so that the ability
>>> to see stale/bad data is minimized.
>>
>> Again, this assumes infinite network bandwidth.
>
> In what way is the total network bandwitdh used in the system different
> for shard copying than for sync replication?
>

Again, when shards are distributed over multiple DCs (or actually even
multiple racks) the bandwidth and latency of local copy will be much
better then the one of the remote copy so the local replication can have
much lower impact on the cluster performance than remote shard copy will.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: "andres@anarazel.de"
Date:
Subject: Re: RFC: replace pg_stat_activity.waiting with something more descriptive
Next
From: Tatsuo Ishii
Date:
Subject: Re: Unicode mapping scripts cleanup