Re: Horizontal scalability/sharding - Mailing list pgsql-hackers

From Josh Berkus
Subject Re: Horizontal scalability/sharding
Date
Msg-id 55E4B615.3090503@agliodbs.com
Whole thread Raw
In response to Horizontal scalability/sharding  (Bruce Momjian <bruce@momjian.us>)
Responses Re: Horizontal scalability/sharding  ("Joshua D. Drake" <jd@commandprompt.com>)
Re: Horizontal scalability/sharding  (Robert Haas <robertmhaas@gmail.com>)
Re: Horizontal scalability/sharding  (Bruce Momjian <bruce@momjian.us>)
Re: Horizontal scalability/sharding  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List pgsql-hackers
All, Bruce:

First, let me put out there that I think the horizontal scaling project
which has buy-in from the community and we're working on is infinitely
better than the one we're not working on or is an underresourced fork.
So we're in agreement on that.  However, I think there's a lot of room
for discussion; I feel like the FDW approach was decided in exclusive
meetings involving a very small number of people.  The FDW approach
*may* be the right approach, but I'd like to see some rigorous
questioning of that before it's final.

Particularly, I'm concerned that we already have two projects in process
aimed at horizontal scalability, and it seems like we could bring either
(or both) projects to production quality MUCH faster than we could make
an FDW-based solution work.  These are:

* pg_shard
* BDR

It seems worthwhile, just as a thought experiment, if we can get where
we want using those, faster, or by combining those with new FDW features.

It's also important to recognize that there are three major use-cases
for write-scalable clustering:

* OLTP: small-medium cluster, absolute ACID consistency, bottlnecked on small writes per second
* DW: small-large cluster, ACID optional, bottlenecked on bulk reads/writes
* Web: medium to very large cluster, ACID optional, bottlenecked on # of connections

We cannot possibly solve all of the above at once, but to the extent
that we recognize all 3 use cases, we can build core features which can
be adapted to all of them.

I'm also going to pontificate that, for a future solution, we should not
focus on write *IO*, but rather on CPU and RAM. The reason for this
thinking is that, with the latest improvements in hardware and 9.5
improvements, it's increasingly rare for machines to be bottlenecked on
writes to the transaction log (or the heap). This has some implications
for system design.  For example, solutions which require all connections
to go through a single master node do not scale sufficiently to be worth
bothering with.

On some other questions from Mason:

> Do we want multiple copies of shards, like the pg_shard approach? Or
> keep things simpler and leave it up to the DBA to add standbys? 

We want multiple copies of shards created by the sharding system itself.Having a separate, and completely orthagonal,
redundancysystem to the
 
sharding system is overly burdensome on the DBA and makes low-data-loss
HA impossible.

> Do we want to leverage table inheritance? If so, we may want to spend
> time improving performance for when the number of shards becomes large
> with what currently exists. If using table inheritance, we could add the
> ability to specify what node (er, foreign server) the subtable lives on.
> We could create top level sharding expressions that allow these to be
> implicitly created.

IMHO, given that we're looking at replacing inheritance because of its
many documented limitations, building sharding on top of inheritance
seems unwise.  For example, many sharding systems are hash-based; how
would an inheritance system transparently use hash keys?

> Should we allow arbitrary expressions for shards, not just range, list
> and hash?

That seems like a 2.0 feature.  It also doesn't seem necessary to
support it for the moderately skilled user; that is, requiring a special
C sharding function for this seems fine to me.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



pgsql-hackers by date:

Previous
From: David Fetter
Date:
Subject: Re: Should \o mean "everything?"
Next
From: "My Life"
Date:
Subject: Re: [PROPOSAL] Table Partition