Home > mailing lists

Re: Horizontal scalability/sharding - Mailing list pgsql-hackers

From	Ashutosh Bapat
Subject	Re: Horizontal scalability/sharding
Date	September 2, 2015 13:26:10
Msg-id	CAFjFpRfz8TE3xZ9KRDu7v3d59h_tG+zZEX0Zd_gcAe_Eie6kow@mail.gmail.com Whole thread Raw
In response to	Re: Horizontal scalability/sharding (Josh Berkus <josh@agliodbs.com>)
List	pgsql-hackers

Tree view

On Wed, Sep 2, 2015 at 12:49 AM, Josh Berkus <josh@agliodbs.com> wrote:

On 09/01/2015 11:36 AM, Tomas Vondra wrote:
>> We want multiple copies of shards created by the sharding system itself.
>> Having a separate, and completely orthagonal, redundancy system to the
>> sharding system is overly burdensome on the DBA and makes low-data-loss
>> HA impossible.
>
> IMHO it'd be quite unfortunate if the design would make it impossible to
> combine those two features (e.g. creating standbys for shards and
> failing over to them).
>
> It's true that solving HA at the sharding level (by keeping multiple
> copies of a each shard) may be simpler than combining sharding and
> standbys, but I don't see why it makes low-data-loss HA impossible.

Other way around, that is, having replication standbys as the only
method of redundancy requires either high data loss or high latency for
all writes.

In the case of async rep, every time we fail over a node, the entire
cluser would need to roll back to the last common known-good replay
point, hence high data loss.

In the case of sync rep, we are required to wait for at least double
network lag time in order to do a single write ... making
write-scalability quite difficult.

Futher, if using replication the sharding system would have no way to
(a) find out immediately if a copy was bad and (b) fail over quickly to
a copy of the shard if the first requested copy was not responding.
With async replication, we also can't use multiple copies of the same
shard as a way to balance read workloads.

If we write to multiple copies as a part of the sharding feature, then
that can be parallelized, so that we are waiting only as long as the
slowest write (or in failure cases, as long as the shard timeout).
Further, we can check for shard-copy health and update shard
availability data with each user request, so that the ability to see
stale/bad data is minimized.

XC (and I guess XL, pgPool II as well) did this by firing same DML statement to all the copies after resolving any volatile references (e.g. now()) in DML, so that all the copies get the same values. That method however needed some row identifier which can identify same row on all the replicas. Primary key is used as row identifier usually, but not all use cases which require shards to be replicated have primary key in their sharded tables.

There are obvious problems with multiplexing writes, which you can
figure out if you knock pg_shard around a bit. But I really think that
solving those problems is the only way to go.

Mind you, I see a strong place for binary replication and BDR for
multi-region redundancy; you really don't want that to be part of the
sharding system if you're aiming for write scalability.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

pgsql-hackers by date:

From: Nikolay Shaplov
Date: 02 September 2015, 12:58:55
Subject: Re: pageinspect patch, for showing tuple data

From: Amit Langote
Date: 02 September 2015, 13:27:18
Subject: Re: Horizontal scalability/sharding

Re: Horizontal scalability/sharding - Mailing list pgsql-hackers

Previous

Next