Re: Horizontal scalability/sharding - Mailing list pgsql-hackers

From Josh Berkus
Subject Re: Horizontal scalability/sharding
Date
Msg-id 55E62205.7000808@agliodbs.com
Whole thread Raw
In response to Horizontal scalability/sharding  (Bruce Momjian <bruce@momjian.us>)
Responses Re: Horizontal scalability/sharding  (Petr Jelinek <petr@2ndquadrant.com>)
List pgsql-hackers
On 09/01/2015 02:29 PM, Tomas Vondra wrote:
> Hi,
> 
> On 09/01/2015 09:19 PM, Josh Berkus wrote:
>> Other way around, that is, having replication standbys as the only
>> method of redundancy requires either high data loss or high latency
>> for all writes.
> 
> I haven't said that. I said that we should allow that topology, not that
> it should be the only method of redundancy.

Ah, OK, I didn't understand you.  Of course I'm in favor of supporting
both methods of redundancy if we can.

>> In the case of sync rep, we are required to wait for at least double
>>  network lag time in order to do a single write ... making
>> write-scalability quite difficult.
> 
> Which assumes that latency (or rather the increase due to syncrep) is a
> problem for the use case. Which may be the case for many use cases, but
> certainly is not a problem for many BI/DWH use cases performing mostly
> large batch loads. In those cases the network bandwidth may be quite
> important resource.

I'll argue that BI/DW is the least interesting use case for mainstream
PostgreSQL because there are production-quality forks which do this
(mostly propietary, but we can work on that).  We really need a solution
which works for OLTP.

> For example assume that there are just two shards in two separate data
> centers, connected by a link with limited bandwidth. Now, let's assume
> you always keep a local replica for failover. So you have A1+A2 in DC1,
> B1+B2 in DC2. If you're in DC1, then writing data to B1 means you also
> have to write data to B2 and wait for it. So either you send the data to
> each node separately (consuming 2x the bandwidth), or send it to B1 and
> let it propagate to B2 e.g. through sync rep.
> 
> So while you may be right in single-DC deployments, with multi-DC
> deployments the situation is quite different - not only that the network
> bandwidth is not unlimited, but because latencies within DC may be a
> fraction of latencies between the locations (to the extent that the
> increase due to syncrep may be just noise). So the local replication may
> be actually way faster.

I'm not seeing how the above is better using syncrep than using shard
copying?

> I can imagine forwarding the data between B1 and B2 even with a purely
> sharding solution, but at that point you effectively re-implemented
> syncrep.

Not really, the mechanism is different and the behavior is different.
One critical deficiency in using binary syncrep is that you can't do
round-robin redundancy at all; every redundant node has to be an exact
mirror of another node.  In a good HA distributed system, you want
multiple shards per node, and you want each shard to be replicated to a
different node, so that in the event of node failure you're not dumping
the full load on one other server.

> IMHO the design has to address the multi-DC setups somehow. I think that
> many of the customers who are so concerned about scaling to many shards
> are also concerned about availability in case of DC outages, no?

Certainly.  But users located in a single DC shouldn't pay the same
overhead as users who are geographically spread.

> I don't follow. With sync rep we do know whether the copy is OK or not,
> because the node either confirms writes or not. The failover certainly
> is more complicated and is not immediate (to the extent of keeping a
> copy at the sharding level), but it's a question of trade-offs.
> 
> It's true we don't have auto-failover solution at the moment, but as I
> said - I can easily imagine most people using just sharding, while some
> deployments use syncrep with manual failover.

As long as direct shard copying is available, I'm happy.  I have no
complaints about additional mechanisms.

I'm bringing this up because the FDW proposal made at pgCon did not
include *any* mechanism for HA/redundancy, just some handwaving about
replication and/or BDR. This was one of the critical design failures of
PostgresXC.  A multinode system without automated node failover and
replacement is a low-availability system.

>> If we write to multiple copies as a part of the sharding feature,
>> then that can be parallelized, so that we are waiting only as long as
>> the slowest write (or in failure cases, as long as the shard
>> timeout). Further, we can check for shard-copy health and update
>> shard availability data with each user request, so that the ability
>> to see stale/bad data is minimized.
> 
> Again, this assumes infinite network bandwidth.

In what way is the total network bandwitdh used in the system different
for shard copying than for sync replication?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



pgsql-hackers by date:

Previous
From: "David E. Wheeler"
Date:
Subject: Re: Horizontal scalability/sharding
Next
From: Greg Stark
Date:
Subject: Re: WIP: About CMake v2