Re: Multi master disjoint cluster - Mailing list pgsql-general

From Vijaykumar Jain
Subject Re: Multi master disjoint cluster
Date
Msg-id CAM+6J95ZzSrdqLTi6fRquOP8gkn_R4YaqZPCoFp+itDhoyV77w@mail.gmail.com
Whole thread Raw
In response to Multi master disjoint cluster  (Vijaykumar Jain <vijaykumarjain.github@gmail.com>)
List pgsql-general
Actually, pls ignore my email.

re reading my mail makes it look like I did not research it throughly and just asked without actual implementation of both options and having a clear goal on what can incompromise along with no read downtime.
I'll write better next time.

On Wed, Oct 26, 2022, 10:04 PM Vijaykumar Jain <vijaykumarjain.github@gmail.com> wrote:
Hi all,

tl;dr
I have a simple question,
given a choice if I can write the same data to two databases in parallel, should I opt for primary / replica setup or multi writer/master setup. This setup has the ability to make use of kafka consumer groups (like two replication slots each having their own lsn offsets)  to write to both db node pairs in parallel via the application layer.

The churn of data is really high, there is a lot of wals generated, around 500gb/hr.

If I go with primary/replica, (lr not feasible)

I need to ensure both are on the same major version. Upgrades are tricky (we don't have qa) so we just have option to schema dump for upgrade compatibility. Data, we trust postgresql for that :). (I wish we had zfs everywhere but no )

Any excl table blocking operations, (although with later versions there are very less blocking operations) can impact queries on replica as well (excluding delay settings).

Corruption can cause downtime (we have tons of them and raids to protect them) so if replica is having issues, we can zero the pages on the replica and do some operations if we isolate the problem pages, else resync the replica from primary. But if primary is having some issues, we copy data from replica to disk and copy in to primary after truncating etc. Some downtime but not a lot. (I am not expert at data recovery) and mostly rely on amcheck, dd, and raid checks.

We don't use pitr (too many wals × 58) or delayed replication as we can't afford more servers.

ddl deploys are guaranteed by replication. So no need to try 2pc like stuff at app layer. (Although apps use deploy tools to ensure eventually the ddls are consistent and idempotent)

Basically primary/replica relieves the app to think what is there on the primary is also on the replica eventually, so there can be source of truth.

But with multi writers, any app mishandling like bug in catching exception etc can result in diversion and no more mirrored setup.
We need to have checks/reconciliation to ensure both write nodes in pair have almost similar data at the end of the day so we can trust this setup independent of any app mistakes.

But if app layer gets robust, we have almost no downtime in reads and writes, we can have both nodes on different versions, (w/o logical replication) can query both nodes real time, no real replication lag issues , conflicts etc, can upgrade like blue green, canary test some changes on one if needed etc.

Am I making sense at all? Or I am sounding confused, and I don't know the difference between primary/replica vs multi writer. This is not bdr like thing, they don't really need each other unless we are into some recovery.

My point is, we have 58 such primary/replica shards (each 10tb+) (consistent hashing at app layer, no fdw)  and there is no scope of downtime for reads, so any issue like post upgrade performance degradation (if any) gives me chills. and we have no qa to test real data.

There are too many dimensions to shard on and aggregations need to run across the shards (Yes there is no scope of data isolation).

pgsql-general by date:

Previous
From: Vijaykumar Jain
Date:
Subject: Multi master disjoint cluster
Next
From: Siddharth Jain
Date:
Subject: How to load data from CSV into a table that has array types in its columns?