Home > mailing lists

proposal: multiple read-write masters in a cluster with wal-streaming synchronization - Mailing list pgsql-hackers

From	Mark Dilger
Subject	proposal: multiple read-write masters in a cluster with wal-streaming synchronization
Date	December 31, 2013 23:02:07
Msg-id	1388519950.72341.YahooMailNeo@web125406.mail.ne1.yahoo.com Whole thread Raw
Responses	Re: proposal: multiple read-write masters in a cluster with wal-streaming synchronization
List	pgsql-hackers

Tree view

This is not entirely "pie in the sky", but feel free to tell me why this is crazy. I

have had this idea for several years, but have not seen anyone else suggest it,

nor any arguments why it would not work.

If we had 64-bit Oids, we could reserve the top 16 bits (for instance) to indicate a

server ID. Each server in a cluster could be configured with a unique serverID in

[0..65535]. Every table, index, etc that is created on a server could be assigned

an Oid congruent to this serverID modulus 65536. Each server would still have a

total of 2^48 Oids that it could assign before reaching Oid exhaustion, which is

64k times more Oids than the current design. (The 16bit / 48bit split is arbitrary,

and could be 8bit / 56bit or whatever.)

Any INSERT, UPDATE, DELETE, INDEX, REINDEX, DROP that was run on a server

could be rejected if the object has the serverId of a different server in the cluster.

The serverId could be kept at all times in memory, and Oids for database objects

are always looked up when handling these SQL operations anyway, so there

should be effectively zero overhead for rejecting invalid operations that attempt

to change on server X a table/index/whatever which belongs to server Y.

I don't see how to run such a setup in serializable mode, but with read committed
it should be ok for server X to change its own tables based on what it perceives
to be the current state of data on server X and Y (reading local copies of tables
from Y) at the same time that server Y changes its own tables based on what it
perceives to be the current state of data on server Y and X (reading local copies
of tables from X). The inconsistencies possible from these actions seem to me
the same as the inconsistencies possible when the two operations are happening
on the same server simultaneously in read committed mode.

WAL records from each server should only ever change objects belonging to that

server, so with some effort it might be possible for every server in the cluster to

replay the WAL records from every other server without collisions. (This is the point

where my idea might be crazy -- I'm not sure how difficult/impossible it would

be to merge WAL information from multiple sources).

The total amount of WAL data that needs to be replayed on any given server would

be proportion to the total amount of data changes cluster-wide, which is no different

than the current state of affairs; it just happens to all come from a single master in

the current design.

The performance problems that at first seem inevitable in this design owing to the

need for each server to wait for wal replay from other servers to avoid data inconsistencies

would only actually happen if updates were frequently based on reads from other

servers. If the vast majority of data changes were not based on the contents of

remotely owned tables/indexes/sequences and such, then changes to local data

could proceed without stalling.

Therefore, this approach would probably be most appropriate for highly partitioned

data, with each partition being owned by one server in the cluster, and modifications

based on data from more than one partition being rare. (SELECTs based on data

from multiple partitions should be just fine.)

If you think about it, that's just an extension of the current architecture. Currently,

all data is partitioned into a single partition belonging to the master, with zero partitions

on the slaves, and all updates are performed based on a single partition, that being

the one on the master. It just isn't obvious that this is the design, because it is a

degenerate case of the more general case that I am describing.

Application software that wants to run INSERTS, UPDATE, etc. would have to connect

to the appropriate server for the table in question, but that might not be too bad

if the data is partitioned anyway -- just connect to the server with the partition you

want to change.

Thanks in advance for having even read this far....

Mark Dilger

pgsql-hackers by date:

From: Pavel Stehule
Date: 31 December 2013, 22:23:02
Subject: Re: proposal: persistent plpgsql plugin info - field plugin_info for plpgsql_function structure

From: Alvaro Herrera
Date: 31 December 2013, 23:32:27
Subject: Re: proposal: multiple read-write masters in a cluster with wal-streaming synchronization

proposal: multiple read-write masters in a cluster with wal-streaming synchronization - Mailing list pgsql-hackers

Previous

Next