Re: Horizontal scalability/sharding - Mailing list pgsql-hackers

From Josh Berkus
Subject Re: Horizontal scalability/sharding
Date
Msg-id 55E5DB2E.8040704@agliodbs.com
Whole thread Raw
In response to Horizontal scalability/sharding  (Bruce Momjian <bruce@momjian.us>)
Responses Re: Horizontal scalability/sharding  (Robert Haas <robertmhaas@gmail.com>)
Re: Horizontal scalability/sharding  ("Joshua D. Drake" <jd@commandprompt.com>)
List pgsql-hackers
On 09/01/2015 02:39 AM, Bruce Momjian wrote:
> On Mon, Aug 31, 2015 at 01:16:21PM -0700, Josh Berkus wrote:
>> I'm also going to pontificate that, for a future solution, we should not
>> focus on write *IO*, but rather on CPU and RAM. The reason for this
>> thinking is that, with the latest improvements in hardware and 9.5
>> improvements, it's increasingly rare for machines to be bottlenecked on
>> writes to the transaction log (or the heap). This has some implications
>> for system design.  For example, solutions which require all connections
>> to go through a single master node do not scale sufficiently to be worth
>> bothering with.
> 
> Well, I highlighted write IO for sharding because sharding is the only
> solution that allows write scaling.  If we want to scale CPU, we are
> better off using server parallelism, and to scale CPU and RAM, a
> multi-master/BDR solution seems best.  (Multi-master doesn't do write
> scaling because you eventually have to write all the data to each node.)

You're assuming that our primary bottleneck for writes is IO.  It's not
at present for most users, and it certainly won't be in the future.  You
need to move your thinking on systems resources into the 21st century,
instead of solving the resource problems from 15 years ago.

Currently, CPU resources and locking are the primary bottlenecks on
writing for the vast majority of the hundreds of servers I tune every
year.  This even includes AWS, with EBS's horrible latency; even in that
environment, most users can outstrip PostgreSQL's ability to handle
requests by getting 20K PRIOPs.

Our real future bottlenecks are:

* ability to handle more than a few hundred connections
* locking limits on the scalability of writes
* ability to manage large RAM and data caches


The only place where IO becomes the bottleneck is for the
batch-processing, high-throughput DW case ... and I would argue that
existing forks already handle that case.

Any sharding solution worth bothering with will solve some or all of the
above by extending our ability to process requests across multiple
nodes.  Any solution which does not is merely an academic curiosity.

> For these reasons, I think sharding has a limited use, and hence, I
> don't think the community will be willing to add a lot of code just to
> enable auto-sharding.  I think it has to be done in a way that adding
> sharding also gives other benefits, like better FDWs and cross-node ACID
> control.
> 
> In summary, I don't think adding a ton of code just to do sharding will
> be acceptable.  A corollary of that, is that if FDWs are unable to
> provide useful sharding, I don't see an acceptable way of adding
> built-in sharding to Postgres.

So, while I am fully in agreement with you that having side benefits to
our sharding tools, I think you're missing the big picture entirely.  In
a few years, clustered/sharded PostgreSQL will be the default
installation, or we'll be a legacy database.  Single-node and
single-master databases are rapidly becoming history.

From my perspective, we don't need an awkward, limited, bolt-on solution
for write-scaling.  We need something which will become core to how
PostgreSQL works.  I just don't see us getting there with the described
FDW approach, which is why I keep raising issues with it.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: On-demand running query plans using auto_explain and signals
Next
From: Robert Haas
Date:
Subject: Re: perlcritic