Re: Horizontal scalability/sharding - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Horizontal scalability/sharding
Date
Msg-id CA+TgmoZuCo5qwFPtZhTFLyjU0LZ7zscMqpMkRvaWpCpWo=_Q9Q@mail.gmail.com
Whole thread Raw
In response to Re: Horizontal scalability/sharding  (Josh Berkus <josh@agliodbs.com>)
Responses Re: Horizontal scalability/sharding
Re: Horizontal scalability/sharding
List pgsql-hackers
On Wed, Sep 2, 2015 at 1:57 PM, Josh Berkus <josh@agliodbs.com> wrote:
> Even if it's only on paper, any new sharding design needs to address
> these questions:
>
> 1. How do we ensure no/minimal data is lost if we lose a node?
> 2. How do we replace a lost node (without taking the cluster down)?
>    2. a. how do we allow an out-of-sync node to "catch up"?
> 3. How do we maintain metadata about good/bad nodes (and shard locations)?
> 4. How do we add nodes to expand the cluster?
>
> There doesn't need to be code for all of the above from version 0.1, but
> there needs to be a plan to tackle those problems.  Otherwise, we'll
> just end up with another dead-end, not-useful-in-production technology.

This is a good point, and I think I agree with it.  Let me make a few
observations:

1. None of this stuff matters very much when the data is strictly
read-only.  You don't lose any data because you made enough copies at
some point in the distant past to ensure that you wouldn't.  You
replace a lost node by taking anew copy.  Nodes never need to catch up
because there are no changes happening.  To make bring up a new node,
you make a copy of an existing node (which doesn't change in the
meantime).  So most of these concerns are about how to handle writes.

2. None of this stuff matters when you only have one copy of the data.
Your system is low-availability, but you just don't care for whatever
reason.  The issue arises when you have multiple copies of the data,
and the data is being changed.  Now, you have to worry about the
copies getting out of sync with each other, especially when failures
happen.

3. IIUC, Postgres-XC handles this problem by reducing at least
volatile functions, maybe all functions, to constants.  Then it
generates an SQL statement to be sent to the data node to make the
appropriate change.  If there's more than one copy of the data, we
send a separate copy of the SQL statement to every node.  I'm not sure
exactly what happens if some of those nodes are not available, but I
don't think it's anything good.  Fundamentally, this model doesn't
allow for many good options in that case.

4. Therefore, I think that we should instead use logical replication,
which might be either synchronous or asynchronous.  When you modify
one copy of the data, that change will then be replicated to all other
nodes.  If you are OK with eventual consistency, this replication can
be asynchronous, and nodes that are off-line will catch up when they
are on-line.  If you are not OK with that, then you must replicate
synchronously to every node before transaction commit; or at least you
must replicate synchronously to every node that is currently on-line.
This presents some challenges: logical decoding currently can't
replicate transactions that are still in process - replication starts
when the transaction commits.  Also, we don't have any way for
synchronous replication to wait for multiple nodes.  But in theory
those seem like limitations that can be lifted.  Also, the GTM needs
to be aware that this stuff is happening, or it will DTWT.  That too
seems like a problem that can be solved.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: pgbench stats per script & other stuff
Next
From: Robert Haas
Date:
Subject: Re: RFC: replace pg_stat_activity.waiting with something more descriptive