Re: Horizontal scalability/sharding - Mailing list pgsql-hackers

From Josh Berkus
Subject Re: Horizontal scalability/sharding
Date
Msg-id 55E7388A.6030300@agliodbs.com
Whole thread Raw
In response to Horizontal scalability/sharding  (Bruce Momjian <bruce@momjian.us>)
Responses Re: Horizontal scalability/sharding  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On 09/01/2015 04:14 PM, Petr Jelinek wrote:
> On 2015-09-02 00:09, Josh Berkus wrote:
>> On 09/01/2015 02:29 PM, Tomas Vondra wrote:
>>> So while you may be right in single-DC deployments, with multi-DC
>>> deployments the situation is quite different - not only that the network
>>> bandwidth is not unlimited, but because latencies within DC may be a
>>> fraction of latencies between the locations (to the extent that the
>>> increase due to syncrep may be just noise). So the local replication may
>>> be actually way faster.
>>
>> I'm not seeing how the above is better using syncrep than using shard
>> copying?
> 
> Shard copying usually assumes that the origin node does the copy - the
> data has to go twice through the slow connection. With replication you
> can replicate locally over fast connection.

Ah, I was thinking of the case of having a single set of copies in the
remote DC, but of course that isn't going to be the case with a highly
redundant setup.

Basically this seems to be saying that, in an ideal setup, we'd have
some kind of synchronous per-shard replication.  We don't have that at
present (sync rep is whole-node, and BDR is asynchronous).  There's also
the question of how to deal with failures and taking bad nodes out of
circulation in such a setup, especially considering that the writes
could be coming from multiple other nodes.

>> Not really, the mechanism is different and the behavior is different.
>> One critical deficiency in using binary syncrep is that you can't do
>> round-robin redundancy at all; every redundant node has to be an exact
>> mirror of another node.  In a good HA distributed system, you want
>> multiple shards per node, and you want each shard to be replicated to a
>> different node, so that in the event of node failure you're not dumping
>> the full load on one other server.
>>
> 
> This assumes that we use binary replication, but we can reasonably use
> logical replication which can quite easily do filtering of what's
> replicated where.

Is there a way to do logical synchronous replication?  I didn't think
there was.

>>> IMHO the design has to address the multi-DC setups somehow. I think that
>>> many of the customers who are so concerned about scaling to many shards
>>> are also concerned about availability in case of DC outages, no?
>>
>> Certainly.  But users located in a single DC shouldn't pay the same
>> overhead as users who are geographically spread.
>>
> 
> Agreed, so we should support both ways, but I don't think it's necessary
> to support both ways in version 0.1. It's just important to not paint
> ourselves into a corner with design decisions that would make one of the
> ways impossible.

Exactly!

Let me explain why I'm so vocal on this point.  PostgresXC didn't deal
with the redundancy/node replacement at all until after version 1.0.
Then, when they tried to address it, they discovered that the code was
chock full of assumptions that "1 node == 1 shard", and breaking that
assumption would require a total refactor of the code (which never
happened).  I don't want to see a repeat of that mistake.

Even if it's only on paper, any new sharding design needs to address
these questions:

1. How do we ensure no/minimal data is lost if we lose a node?
2. How do we replace a lost node (without taking the cluster down)?  2. a. how do we allow an out-of-sync node to
"catchup"?
 
3. How do we maintain metadata about good/bad nodes (and shard locations)?
4. How do we add nodes to expand the cluster?

There doesn't need to be code for all of the above from version 0.1, but
there needs to be a plan to tackle those problems.  Otherwise, we'll
just end up with another dead-end, not-useful-in-production technology.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Hooking at standard_join_search (Was: Re: Foreign join pushdown vs EvalPlanQual)
Next
From: Merlin Moncure
Date:
Subject: Re: Horizontal scalability/sharding