Re: How clustering for scale out works in PostgreSQL - Mailing list pgsql-performance

From Craig Ringer
Subject Re: How clustering for scale out works in PostgreSQL
Date
Msg-id 5224423E.7040707@2ndquadrant.com
Whole thread Raw
In response to How clustering for scale out works in PostgreSQL  (bsreejithin <bsreejithin@gmail.com>)
List pgsql-performance
bsreejithin wrote:
>
> I am *expecting 1000+ hits to my PostgreSQL DB* and I doubt my standalone DB
> will be able to handle it.
>
> So I want to *scale out by adding more servers to share the load*. For this,
> I want to do clustering.
>
>  DB server was 4 Core 12GB RAM.

You're jumping way ahead here. You have a medium sized server that
should effortlessly handle most loads if its I/O subsystem is up to it.

It's a good idea to plan for what you'll do as load grows, but it's not
necessary to jump straight to engineering some "web scale" monstrosity
if you don't have to.

> The performance test that was conducted was for 1 Hour.

> There are 6 transactions. 2 DB inserts and 4 SELECTs.
> Every 2 minutes there will be 4 SELECTs. And every 3 minutes there will be 2
> DB inserts.

It's not possible to give useful specific advice without knowing what
the "selects" and "updates" you're dealing with are. After all, a
single-tuple update of a non-indexed field with no trigger side-effects
will be way sub-millisecond. On the other hand, a big update over a
multi-table join that causes updates on several multi-column indexes /
GIN indexes / etc, a cascade update, etc, might take hours.

You need to work out what the actual load is. Determine whether you're
bottlenecked on disk reads, disk writes, disk flushes (fsync), CPU, etc.

Ask some basic tuning questions. Does your DB fit in RAM? Do at least
the major indexes and "hot" smaller tables fit in RAM? Is
effective_cache_size set to tell the query planner that.

Look at the query plans. Is there anything grossly unreasonable? Do you
need to adjust any tuning params (random_page_cost, etc)? Is
effective_cache_size set appropriately for the server?  Figure out
whether there are any indexes that're worth creating that won't make the
write load too much worse.

Find the point where throughput stops scaling up with load on the
server. Put a connection pooler in place and limit concurrent working
connections to PostgreSQL to about that level; overall performance will
be greatly improved by not trying to do too much all at once.

> I am *curious to know how clustering works in PostgreSQL.* (I don't want to
> know how to setup cluster - as of now. Just want to know how clustering
> works).

The "clustering" available in PostgreSQL is a variety of forms of
replication.

It is important to understand that attempting to scale out to
multi-server setups requires significant changes to many applications.
There is no transparent multi-master clustering for PostgreSQL.

If you're on a single server, you can rely on the strict rules
PostgreSQL follows for traffic isolation. It will ensure that two
updates can't conflict with row-level locking. In SERIALIZABLE isolation
it'll protect against a variety of concurrency problems.

Most of that goes away when you go multi-server. If you're using a
single master and multiple read-replicas you have to deal with lags,
where the replicas haven't yet seen / replayed transactions performed on
the master. So you might UPDATE a row in one transaction, only to find
that when you SELECT it the update isn't there ... then it suddenly
appears when you SELECT again. Additionally, long-running queries on the
read-only replicas can be aborted to allow the replica to continue
replaying changes from the master.

You can work around that one with synchronous replication, but then you
create another set of performance challenges on the master.

There are also a variety of logical / row-level replication options.
They have their own trade-offs in terms of impact on master performance,
transaction consistency, etc.

It only gets more "fun" when you want multiple masters, where you can
write to more than one server.  Don't go there unless you have to.

> When I look at some of the content available while googling, I am getting
> more and more confused, as I find that in most of the sites, clustering is
> used interchangeably with replication.

Well, a cluster of replicas is still a cluster.

If you mean "transparent multi-master clustering", well that's another
thing entirely.

I strongly recommend you go back to basics. Evaluate the capacity of the
server you've got, update PostgreSQL, characterize the load, do some
basic tuning, benchmark based on a simulation of your load, get a
connection pooler in place, do some basic query pattern and plan
analysis, etc.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


pgsql-performance by date:

Previous
From: Craig Ringer
Date:
Subject: Re: How clustering for scale out works in PostgreSQL
Next
From: Łukasz Walkowski
Date:
Subject: Re: Varchar vs foreign key vs enumerator - table and index size