Thread: Transaction-controlled robustness for replication

Transaction-controlled robustness for replication

From

Simon Riggs

Date:

22 July 2008, 16:07:13

One of the cool features of 8.3 was the ability to control at the
transaction level whether we would use synchronous or asynchronous
commit.

We're planning to add integrated replication features to 8.4, and I
think it will be possible to extend the concept of asynchronous commit
to a more general form of transaction-level robustness. Note that the
proof that its possible to mix asynchronous and synchronous transactions
on the same system has already been established, so this is just a
straight extension of that concept.

Asynchronous commit controls whether we go to disk at time of commit, or
whether we defer this slightly. We have the same options with
replication: do we replicate at time of commit, or do we defer this
slightly for performance reasons. DRBD and other replication systems
show us that there is actually another difference when talking about
synchronous replication: do we go to disk on the standby before
acknowledging the primary?

We can generalise this as three closed questions, answered either Yes
(Synchronous) or No (Asynchronous)

* Does WAL get forced to disk on primary at commit time?
* Does WAL get forced across link to standby at commit time?
* Does WAL get forced to disk on standby at commit time?

In code, these are simple if tests: Do we wait, or not?

We could represent this with 3 parameters:
synchronous_commit = on | off
synchronous_standby_transfer = on | off
synchronous_standby_wal_fsync = on | off

If we are able to define these robustness characteristics for each
transaction *separately* then it will represent an industry first: no
other database has that capability implemented or planned on published
roadmaps, nor has it been discussed in research to my knowledge.

Changing the parameter setting at transaction-level would be expensive
if we had to set three parameters.

Or we could use just two parameters:
synchronous_commit = on | off
synchronous_replication = 'AA', 'SA' or 'SS' with A = Asynchronous, S = Synchronous  which corresponds with DRBD's
algorithmslike thisDRBD A = AADRBD B = SADRBD C = SS
 

Or we could use just a single parameter
synchronous_commit = 'AAA', 'SAA', 'SSA', 'SSS' or on |off when no
log-based replication is defined

Having the ability to set these at the transaction-level would be very
cool. Having it set via a *single* parameter would make it much more
viable to switch between AAA for bulk, low importance data and SSS for
very low volume, critical data, or somewhere in between on the same
server, at the same time.

So proposal in summary is
* allow various modes of synchronous replication for perf/robustness
* allow modes to be specified per-transaction
* allow modes to be specified as a single parameter

I think Itagaki may have described similar concepts at PGCon2008, but
this thread has been started to make sure that meme definitely has been
released into the wild, and to discuss how we might specify it?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Transaction-controlled robustness for replication

From

Markus Wanner

Date:

22 July 2008, 18:38:02

Hi,

very nice proposal and thoughts. Allow me to compare against Postgres-R 
again.

Simon Riggs wrote:
> Asynchronous commit controls whether we go to disk at time of commit, or
> whether we defer this slightly. We have the same options with
> replication: do we replicate at time of commit, or do we defer this
> slightly for performance reasons. DRBD and other replication systems
> show us that there is actually another difference when talking about
> synchronous replication: do we go to disk on the standby before
> acknowledging the primary?

Yeah, I was thinking into the same direction for Postgres-R. There 
already exist three replication levels: sync, eager and lazy.

Having more than just a primary and a standby server in mind, one can 
also argue about how many remote nodes need to have written the changes 
to disc, before commit is confirmed in 'sync' mode. At least a majority 
is required, probably more nodes are wanted.

The eager mode is what the original Postgres-R approach is all about and 
is pretty much the only one I've implemented, at the moment. It only 
requires confirmation from the GCS, which means at least a majority of 
the nodes have received the change set (and will be able to apply it). 
(This leads to a corner case for a full cluster outage, see [1]).

In async mode, commit is confirmed before sending the change set to 
other nodes.

> If we are able to define these robustness characteristics for each
> transaction *separately* then it will represent an industry first: 

Yeah, that would be pretty cool.

> no
> other database has that capability implemented or planned on published
> roadmaps, nor has it been discussed in research to my knowledge.

Well, a partial implementation in Postgres-R, if that counts... ;-)

Regards

Markus

[1]: One of the few threads on the Postgres-R-general mailing list:
http://pgfoundry.org/pipermail/postgres-r-general/2006-August/000002.html

Re: Transaction-controlled robustness for replication

From

"Marko Kreen"

Date:

22 July 2008, 19:39:31

On 7/22/08, Simon Riggs <simon@2ndquadrant.com> wrote:
>  We could represent this with 3 parameters:
>  synchronous_commit = on | off
>  synchronous_standby_transfer = on | off
>  synchronous_standby_wal_fsync = on | off
>
>  If we are able to define these robustness characteristics for each
>  transaction *separately* then it will represent an industry first: no
>  other database has that capability implemented or planned on published
>  roadmaps, nor has it been discussed in research to my knowledge.
>
>  Changing the parameter setting at transaction-level would be expensive
>  if we had to set three parameters.

How about extending BEGIN.with additional keywords?

-- 
marko

Re: Transaction-controlled robustness for replication

From

Simon Riggs

Date:

22 July 2008, 20:02:41

On Wed, 2008-07-23 at 01:39 +0300, Marko Kreen wrote:
> On 7/22/08, Simon Riggs <simon@2ndquadrant.com> wrote:
> >  We could represent this with 3 parameters:
> >  synchronous_commit = on | off
> >  synchronous_standby_transfer = on | off
> >  synchronous_standby_wal_fsync = on | off
> >
> >  If we are able to define these robustness characteristics for each
> >  transaction *separately* then it will represent an industry first: no
> >  other database has that capability implemented or planned on published
> >  roadmaps, nor has it been discussed in research to my knowledge.
> >
> >  Changing the parameter setting at transaction-level would be expensive
> >  if we had to set three parameters.
> 
> How about extending BEGIN.with additional keywords?

SQL Standard, plus many interfaces hide BEGIN from you.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Transaction-controlled robustness for replication

From

Jens-Wolfhard Schicke

Date:

22 July 2008, 21:58:55

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Simon Riggs wrote:
> Asynchronous commit controls whether we go to disk at time of commit, or
> whether we defer this slightly. We have the same options with
> replication: do we replicate at time of commit, or do we defer this
> slightly for performance reasons. DRBD and other replication systems
> show us that there is actually another difference when talking about
> synchronous replication: do we go to disk on the standby before
> acknowledging the primary?
> 
> We can generalise this as three closed questions, answered either Yes
> (Synchronous) or No (Asynchronous)
> 
> * Does WAL get forced to disk on primary at commit time?
> * Does WAL get forced across link to standby at commit time?
> * Does WAL get forced to disk on standby at commit time?
* Does WAL get applied [and synced] to disk on standby at commit time?
This is important if you want to use the standby as a read-only.
I am slightly confused about what the fsync setting does to all this, hence
the brackets.

I think that questions 2 and 3 are trivially bundled together. Once the
user can specify 2, implementing 3 should be trivial and vice versa.
I am not even convinced that these need to be two different parameters.
Also please note that an answer of "yes" to 3 means that 2 must also
be answered "yes".

> We could represent this with 3 parameters:
> synchronous_commit = on | off
> synchronous_standby_transfer = on | off
> synchronous_standby_wal_fsync = on | off
synchronous_standby_apply = on | off    # just to propose a name

> Changing the parameter setting at transaction-level would be expensive
> if we had to set three parameters.
What exactly does "expensive" mean? All three parameters can probably be set
in one TCP packet from client to server.

> Or we could use just a single parameter
> synchronous_commit = 'AAA', 'SAA', 'SSA', 'SSS' or on |off when no
> log-based replication is defined
> 
> Having the ability to set these at the transaction-level would be very
> cool. Having it set via a *single* parameter would make it much more
> viable to switch between AAA for bulk, low importance data and SSS for
> very low volume, critical data, or somewhere in between on the same
> server, at the same time.
The problem with a single parameter is that everything becomes position
dependent and if whyever a new parameter is introduced, it's not easy to
upgrade old application code.

> So proposal in summary is
> * allow various modes of synchronous replication for perf/robustness
> * allow modes to be specified per-transaction
> * allow modes to be specified as a single parameter
How about creating named modes? This would give the user the ability to
define more fine-grained control especially in larger clusters of fail-over/read-only
servers without totally clogging the parameter space and application code.
Whether this should be done SQL-style or in some config file is not so clear to me,
although I'd prefer SQL-style like

CREATE SYNCHRONIZING MODE immediate_readonly AS LOCAL        SYNCHRONOUS APPLY 192.168.0.10 SYNCHRONOUS APPLY        --
read-onlyslave 192.168.0.11 SYNCHRONOUS APPLY        -- read-only slave 192.168.0.20 SYNCHRONOUS SHIP         --
backup-server192.168.0.21 SYNCHRONOUS SHIP         -- backup-server 192.168.0.30 SYNHCRONOUS FSYNC        --
backup-serverwith fast disks
 
;

and then something like

synchronize_mode = immediate_readonly;

Yeah, I know, give patches not pipe-dreams :)

Regards, Jens-Wolfhard Schicke
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFIhoAdzhchXT4RR5ARAo/6AJ9R6LA0TsPvD/TBy6Bh1q0q5JvyKQCbBycx
1CKc8dqxnlvmH/hbi1Px+v8=
=l5P4
-----END PGP SIGNATURE-----

Re: Transaction-controlled robustness for replication

From

Markus Wanner

Date:

23 July 2008, 03:21:01

Hi,

Jens-Wolfhard Schicke wrote:
>> * Does WAL get forced to disk on primary at commit time?
>> * Does WAL get forced across link to standby at commit time?
>> * Does WAL get forced to disk on standby at commit time?
> * Does WAL get applied [and synced] to disk on standby at commit time?

I think that's what Simon means by his question no 3. It wouldn't make 
much sense to me otherwise.

I'm assuming the standby node has it's own physical format, so the 
changes from the remote WAL need to be transformed to a local WAL, which 
then needs to be written to disc. For Postgres, this pretty much means 
applying the changes and committing them. You never need to store the 
remote WAL on physical storage, what would that be good for?

> I think that questions 2 and 3 are trivially bundled together. Once the
> user can specify 2, implementing 3 should be trivial and vice versa.

That might well be, yes. The code to collect changes from a transaction 
and then apply them remotely is pretty much the same, no matter when it 
is being executed. But it certainly makes a difference in the balance 
between performance and availability, which is a decision the user 
should be able to make for his specific application (or even better, per 
transaction, as proposed here and in Postgres-R).

> I am not even convinced that these need to be two different parameters.

Consider a standby heavily loaded (i/o) with some OLAP queries. Why 
should the master wait until the standby has written anything to disk 
for him?

> Also please note that an answer of "yes" to 3 means that 2 must also
> be answered "yes".

Agreed. There's no 'AS' mode possible, only 'SS', 'SA' and 'SS'.

> How about creating named modes? This would give the user the ability to
> define more fine-grained control especially in larger clusters of fail-over/read-only
> servers without totally clogging the parameter space and application code.
> Whether this should be done SQL-style or in some config file is not so clear to me,
> although I'd prefer SQL-style like
> 
> CREATE SYNCHRONIZING MODE immediate_readonly AS
>   LOCAL        SYNCHRONOUS APPLY
>   192.168.0.10 SYNCHRONOUS APPLY        -- read-only slave
>   192.168.0.11 SYNCHRONOUS APPLY        -- read-only slave
>   192.168.0.20 SYNCHRONOUS SHIP         -- backup-server
>   192.168.0.21 SYNCHRONOUS SHIP         -- backup-server
>   192.168.0.30 SYNHCRONOUS FSYNC        -- backup-server with fast disks
> ;

Hm.. that's an interesting idea. Especially considering the number of 
options that arise with more than two or three nodes, where you maybe 
also want to specify how many nodes must have written the changes to 
disk before confirming the commit.

In Postgres-R, I've added a TRANSACTION REPLICATION LEVEL, which can be 
either SYNC, EAGER or LAZY. Maybe that's not quite sufficient. On the 
other hand, I don't think any other option here makes any sense. (Above, 
you yourself doubt that sync is different enough from eager).

Regards

Markus

Re: Transaction-controlled robustness for replication

From

Simon Riggs

Date:

25 July 2008, 14:02:57

On Wed, 2008-07-23 at 10:49 +1000, Jens-Wolfhard Schicke wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Simon Riggs wrote:
> > Asynchronous commit controls whether we go to disk at time of commit, or
> > whether we defer this slightly. We have the same options with
> > replication: do we replicate at time of commit, or do we defer this
> > slightly for performance reasons. DRBD and other replication systems
> > show us that there is actually another difference when talking about
> > synchronous replication: do we go to disk on the standby before
> > acknowledging the primary?
> > 
> > We can generalise this as three closed questions, answered either Yes
> > (Synchronous) or No (Asynchronous)
> > 
> > * Does WAL get forced to disk on primary at commit time?
> > * Does WAL get forced across link to standby at commit time?
> > * Does WAL get forced to disk on standby at commit time?

> * Does WAL get applied [and synced] to disk on standby at commit time?

> This is important if you want to use the standby as a read-only.

That's an assumption - I'm not sure its a requirement in all cases. 

If a standby query needed to see particular data then the *query* would
wait until correct data has been applied. I certainly wouldn't want to
penalise writing transactions on the primary because there *might* be a
statement on the standby that wishes to see an updated view.

> I am slightly confused about what the fsync setting does to all this, hence
> the brackets.

There is no sync() during WAL apply when each individual transaction
hits commit. This is because there is "no WAL" i.e. changes comes from
WAL to the database, so we have no need of a second WAL to protect the
changes being made.

> I think that questions 2 and 3 are trivially bundled together. Once the
> user can specify 2, implementing 3 should be trivial and vice versa.
> I am not even convinced that these need to be two different parameters.
> Also please note that an answer of "yes" to 3 means that 2 must also
> be answered "yes".

Yes, they are trivially bundled together, but there is benefit in doing
so. The difference between 2 and 3 is about performance and levels of
robustness.

Waiting for transfer across link to standby (only) is much faster than
waiting for transfer *and* waiting for fsync. Probably twice as fast in
a tightly coupled cluster, i.e. option 3 will make your transactions
somewhat more robust, but twice the response time and half the
throughput.

> > We could represent this with 3 parameters:
> > synchronous_commit = on | off
> > synchronous_standby_transfer = on | off
> > synchronous_standby_wal_fsync = on | off
> synchronous_standby_apply = on | off    # just to propose a name
> 
> > Changing the parameter setting at transaction-level would be expensive
> > if we had to set three parameters.
> What exactly does "expensive" mean? All three parameters can probably be set
> in one TCP packet from client to server.

Expensive as in we need to parse and handle each statement separately.
If we have a single parameter then much lower overhead.

> > Or we could use just a single parameter
> > synchronous_commit = 'AAA', 'SAA', 'SSA', 'SSS' or on |off when no
> > log-based replication is defined
> > 
> > Having the ability to set these at the transaction-level would be very
> > cool. Having it set via a *single* parameter would make it much more
> > viable to switch between AAA for bulk, low importance data and SSS for
> > very low volume, critical data, or somewhere in between on the same
> > server, at the same time.

> The problem with a single parameter is that everything becomes position
> dependent and if whyever a new parameter is introduced, it's not easy to
> upgrade old application code.

True, but what new parameter do you imagine?

> > So proposal in summary is
> > * allow various modes of synchronous replication for perf/robustness
> > * allow modes to be specified per-transaction
> > * allow modes to be specified as a single parameter

> How about creating named modes? 

Good idea

> This would give the user the ability to
> define more fine-grained control especially in larger clusters of fail-over/read-only
> servers without totally clogging the parameter space and application code.
> Whether this should be done SQL-style or in some config file is not so clear to me,
> although I'd prefer SQL-style like
> 
> CREATE SYNCHRONIZING MODE immediate_readonly AS
>   LOCAL        SYNCHRONOUS APPLY
>   192.168.0.10 SYNCHRONOUS APPLY        -- read-only slave
>   192.168.0.11 SYNCHRONOUS APPLY        -- read-only slave
>   192.168.0.20 SYNCHRONOUS SHIP         -- backup-server
>   192.168.0.21 SYNCHRONOUS SHIP         -- backup-server
>   192.168.0.30 SYNHCRONOUS FSYNC        -- backup-server with fast disks
> ;

Thats not how we define parameter values, so no.

> and then something like
> 
> synchronize_mode = immediate_readonly;
> 
> Yeah, I know, give patches not pipe-dreams :)

Ah yes. Of course.

The only sensible options are these four:

AAA    
SAA    
SSA    
SSS

plus the existing on & off

So we give them 4 names and set it using a single parameter value.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Transaction-controlled robustness for replication

From

Markus Wanner

Date:

26 July 2008, 05:23:21

Hi,

Simon Riggs wrote:
> There is no sync() during WAL apply when each individual transaction
> hits commit. This is because there is "no WAL" i.e. changes comes from
> WAL to the database, so we have no need of a second WAL to protect the
> changes being made.

Aha, that makes much more sense to me now. So you are not just using the 
WAL to transfer changes, but the remote WAL replaces the one on the 
standby completely.

Do you have replication to multiple slaves on the radar?

What I still don't understand is, why you are speaking about "logical" 
replication. It rather sounds like an ordinary log shipping approach, 
where the complete WAL is sent over the wire. Nothing wrong with that, 
it certainly fits many needs and I welcome such a solution for Postgres.

In another thread, you are saying that application of the WAL on the 
standby is "logical". This sounds like one could choose to skip certain 
parts of the WAL on the standby, but still the complete WAL would have 
to be sent. So this would only save a bit of i/o on the standby, right?

> Expensive as in we need to parse and handle each statement separately.
> If we have a single parameter then much lower overhead.

Is that really much of a concern when otherwise caring about network and 
i/o latency? And what about sane default settings per session and 
database, so you won't even need to explicitly set them for the majority 
of transactions?

Regards

Markus

Re: Transaction-controlled robustness for replication

From

Simon Riggs

Date:

26 July 2008, 06:02:09

On Sat, 2008-07-26 at 10:17 +0200, Markus Wanner wrote:
> 
> > Expensive as in we need to parse and handle each statement
> separately.
> > If we have a single parameter then much lower overhead.
> 
> Is that really much of a concern when otherwise caring about network
> and i/o latency? 

I believe so. Jens-Wolfhard has provided the solution it seems.

> And what about sane default settings per session and 
> database, so you won't even need to explicitly set them for the
> majority of transactions?

Session pools.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Transaction-controlled robustness for replication

From

Simon Riggs

Date:

26 July 2008, 06:03:12

On Sat, 2008-07-26 at 10:17 +0200, Markus Wanner wrote:
> What I still don't understand is, why you are speaking about
> "logical" 
> replication. It rather sounds like an ordinary log shipping approach, 
> where the complete WAL is sent over the wire. Nothing wrong with
> that, 
> it certainly fits many needs and I welcome such a solution for
> Postgres.

Yes, first stage is log shipping. Second stage is either physical or
logical apply.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Transaction-controlled robustness for replication

From

Bruce Momjian

Date:

12 August 2008, 12:52:02

Simon Riggs wrote:
> We can generalise this as three closed questions, answered either Yes
> (Synchronous) or No (Asynchronous)
> 
> * Does WAL get forced to disk on primary at commit time?
> * Does WAL get forced across link to standby at commit time?
> * Does WAL get forced to disk on standby at commit time?
> 
> In code, these are simple if tests: Do we wait, or not?
> 
> We could represent this with 3 parameters:
> synchronous_commit = on | off
> synchronous_standby_transfer = on | off
> synchronous_standby_wal_fsync = on | off
> 
...
> Or we could use just two parameters:
> synchronous_commit = on | off
> synchronous_replication = 'AA', 'SA' or 'SS'
>   with A = Asynchronous, S = Synchronous 
>   which corresponds with DRBD's algorithms like this
>     DRBD A = AA
>     DRBD B = SA
>     DRBD C = SS
> 
> Or we could use just a single parameter
> synchronous_commit = 'AAA', 'SAA', 'SSA', 'SSS' or on |off when no
> log-based replication is defined

First, I love this kind of control.

Second, I think our current control for the main/master are sufficient
that we don't need to add additional variables for this, so we can focus
just on whether we wait for the slave.

Third, we can control whether the slave WAL fsync's by setting things on
the slave, so I don't think we need a master setting for that.

Fourth, I don't think the boolean idea will work because there are
boolean combinations that are illogical, like having
synchronous_standby_transfer off and synchronous_standby_wal_fsync on.

I think you need to make it an enumerated type like log_min_messages; 
something like:
wal_transfer_wait

with values of:
nothing:    have network traffic send WAL as needednetflush:    wait for flush of WAL network packets to slaveprocess:
 wait for slave to process WAL traffic and        optionally fsync

The 'process' option either waits for fsync on the slave or not
depending on how the slave is configured, meaning you could use
synchronous_commit off to not wait for the fsync to disk.

I think we can add a table in the documention to show how to set things
up in postgresql.conf for async master and async slave, and other
combinations.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Transaction-controlled robustness for replication

From

Bruce Momjian

Date:

12 August 2008, 12:53:10

Simon Riggs wrote:
> 
> On Sat, 2008-07-26 at 10:17 +0200, Markus Wanner wrote:
> > What I still don't understand is, why you are speaking about
> > "logical" 
> > replication. It rather sounds like an ordinary log shipping approach, 
> > where the complete WAL is sent over the wire. Nothing wrong with
> > that, 
> > it certainly fits many needs and I welcome such a solution for
> > Postgres.
> 
> Yes, first stage is log shipping. Second stage is either physical or
> logical apply.

What is the attraction of logical application of the WAL logs? 
Transmitting to a server with different architecture?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Transaction-controlled robustness for replication

From

Simon Riggs

Date:

12 August 2008, 13:35:44

On Tue, 2008-08-12 at 11:52 -0400, Bruce Momjian wrote:
> Simon Riggs wrote:
> > 
> > On Sat, 2008-07-26 at 10:17 +0200, Markus Wanner wrote:
> > > What I still don't understand is, why you are speaking about
> > > "logical" 
> > > replication. It rather sounds like an ordinary log shipping approach, 
> > > where the complete WAL is sent over the wire. Nothing wrong with
> > > that, 
> > > it certainly fits many needs and I welcome such a solution for
> > > Postgres.
> > 
> > Yes, first stage is log shipping. Second stage is either physical or
> > logical apply.
> 
> What is the attraction of logical application of the WAL logs? 
> Transmitting to a server with different architecture?

Yes,

* different release
* different encoding
* different CPU architecture
* (with the correct transform) a different DBMS

So logical apply can provide a route for data transfer between
applications, not just replication for DR or HA.

Physical apply works and will be more performant, but it will always be
more restrictive. So there are arguments for doing it both ways.

I believe that we can and should offer both options to provide customer
choice. Ideally, it would be nice to be able to switch between the two
without significant reconfiguration, but that's definitely not for this
release.

(Pragmatically, implementation will be limited by my funding.)

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Transaction-controlled robustness for replication

From

Simon Riggs

Date:

12 August 2008, 13:47:38

On Tue, 2008-08-12 at 11:51 -0400, Bruce Momjian wrote:
> I think you need to make it an enumerated type like log_min_messages; 
> something like:
> 
>     wal_transfer_wait

Yeh, that way sounds best and I like name.

> with values of:
> 
>     nothing:    have network traffic send WAL as needed
>     netflush:    wait for flush of WAL network packets to slave
>     process:    wait for slave to process WAL traffic and
>             optionally fsync

Suggest  async syncnet syncdisk

> The 'process' option either waits for fsync on the slave or not
> depending on how the slave is configured, meaning you could use
> synchronous_commit off to not wait for the fsync to disk.

Hmmm, not sure that flicking a switch on the standby should cause a loss
of performance on the master. That will be an accident waiting to
happen. Best to make all things that effect the performance/robustness
of the master be configuration options on the master side.

> I think we can add a table in the documention to show how to set things
> up in postgresql.conf for async master and async slave, and other
> combinations.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Transaction-controlled robustness for replication

From

Bruce Momjian

Date:

12 August 2008, 13:49:04

Simon Riggs wrote:
> > What is the attraction of logical application of the WAL logs? 
> > Transmitting to a server with different architecture?
> 
> Yes,
> 
> * different release
> * different encoding
> * different CPU architecture
> * (with the correct transform) a different DBMS
> 
> So logical apply can provide a route for data transfer between
> applications, not just replication for DR or HA.
> 
> Physical apply works and will be more performant, but it will always be
> more restrictive. So there are arguments for doing it both ways.
> 
> I believe that we can and should offer both options to provide customer
> choice. Ideally, it would be nice to be able to switch between the two
> without significant reconfiguration, but that's definitely not for this
> release.

I assume you are focusing on physical first, becuase that is easiest.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Transaction-controlled robustness for replication

From

Bruce Momjian

Date:

12 August 2008, 13:55:12

Simon Riggs wrote:
> 
> On Tue, 2008-08-12 at 11:51 -0400, Bruce Momjian wrote:
> > I think you need to make it an enumerated type like log_min_messages; 
> > something like:
> > 
> >     wal_transfer_wait
> 
> Yeh, that way sounds best and I like name.
> 
> > with values of:
> > 
> >     nothing:    have network traffic send WAL as needed
> >     netflush:    wait for flush of WAL network packets to slave
> >     process:    wait for slave to process WAL traffic and
> >             optionally fsync
> 
> Suggest 
>   async
>   syncnet
>   syncdisk

I think the first two are fine, but 'syncdisk' might be wrong if the slave
has 'synchronous_commit = off'.  Any ideas?

> > The 'process' option either waits for fsync on the slave or not
> > depending on how the slave is configured, meaning you could use
> > synchronous_commit off to not wait for the fsync to disk.
> 
> Hmmm, not sure that flicking a switch on the standby should cause a loss
> of performance on the master. That will be an accident waiting to
> happen. Best to make all things that effect the performance/robustness
> of the master be configuration options on the master side.

Well, we are not removing synchronous_commit so a slave certainly could
be configured to be async, so I don't see that making the master set this
makes any sense, unless you want the master to somehow override that.

So, I think there are three options, async, syncnet, and perhaps
'commit', meaning the slave committed the transactions.  That can be
sync slave or async slave, based on the configuration of the slave.

-- Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Transaction-controlled robustness for replication

From

Simon Riggs

Date:

12 August 2008, 14:26:18

On Tue, 2008-08-12 at 12:54 -0400, Bruce Momjian wrote:
> Simon Riggs wrote:
> > 
> > On Tue, 2008-08-12 at 11:51 -0400, Bruce Momjian wrote:
> > > I think you need to make it an enumerated type like log_min_messages; 
> > > something like:
> > > 
> > >     wal_transfer_wait
> > 
> > Yeh, that way sounds best and I like name.
> > 
> > > with values of:
> > > 
> > >     nothing:    have network traffic send WAL as needed
> > >     netflush:    wait for flush of WAL network packets to slave
> > >     process:    wait for slave to process WAL traffic and
> > >             optionally fsync
> > 
> > Suggest 
> >   async
> >   syncnet
> >   syncdisk
> 
> I think the first two are fine, but 'syncdisk' might be wrong if the slave
> has 'synchronous_commit = off'.  Any ideas?

Yes, synchronous_commit can be set in the postgresql.conf, but its great
advantage is it is a userset parameter.

The main point of the post is that the parameter would be transaction
controlled, so *must* be set in the transaction and thus *must* be set
on the master. Otherwise the capability is not available in the way I am
describing.

synchronous_commit applies to transaction commits. The code path would
be completely different here, so having parameter passed as an info byte
from master will not cause code structure problems or performance
problems.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Transaction-controlled robustness for replication

From

Bruce Momjian

Date:

12 August 2008, 14:36:34

Simon Riggs wrote:
> > > > with values of:
> > > > 
> > > >     nothing:    have network traffic send WAL as needed
> > > >     netflush:    wait for flush of WAL network packets to slave
> > > >     process:    wait for slave to process WAL traffic and
> > > >             optionally fsync
> > > 
> > > Suggest 
> > >   async
> > >   syncnet
> > >   syncdisk
> > 
> > I think the first two are fine, but 'syncdisk' might be wrong if the slave
> > has 'synchronous_commit = off'.  Any ideas?
> 
> Yes, synchronous_commit can be set in the postgresql.conf, but its great
> advantage is it is a userset parameter.
> 
> The main point of the post is that the parameter would be transaction
> controlled, so *must* be set in the transaction and thus *must* be set
> on the master. Otherwise the capability is not available in the way I am
> describing.

Oh, so synchronous_commit would not control WAL sync on the slave?  What
about our fsync parameter?  Because the slave is read-only, I saw no
disadvantage of setting synchronous_commit to off in postgresql.conf on
the slave.

> synchronous_commit applies to transaction commits. The code path would
> be completely different here, so having parameter passed as an info byte
> from master will not cause code structure problems or performance
> problems.

OK, I was just trying to simplify it.  The big problem with an async
slave is that not only would you have lost data in a failover, but the
database might be inconsistent, like fsync = off, which is something I
think we want to try to avoid, which is why I was suggesting
synchronous_commit = off.

Or were you thinking of always doing fsync on the slave, no matter what.
I am worried the slave might not be able to keep up (being
single-threaded) and therefore we should allow a way to async commit on
the slave.  Certainly if the master is async sending the data, there is
no need to do a synchronous_commit on the slave.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Transaction-controlled robustness for replication

From

Tom Lane

Date:

12 August 2008, 16:41:15

Simon Riggs <simon@2ndquadrant.com> writes:
> On Tue, 2008-08-12 at 11:52 -0400, Bruce Momjian wrote:
>> What is the attraction of logical application of the WAL logs? 
>> Transmitting to a server with different architecture?

> Yes,

> * different release
> * different encoding
> * different CPU architecture
> * (with the correct transform) a different DBMS

The notion that the WAL logs will ever be portable across such
differences is so ... so ... well, it's barely worth laughing at.
        regards, tom lane

Re: Transaction-controlled robustness for replication

From

Simon Riggs

Date:

12 August 2008, 16:59:29

On Tue, 2008-08-12 at 15:40 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > On Tue, 2008-08-12 at 11:52 -0400, Bruce Momjian wrote:
> >> What is the attraction of logical application of the WAL logs? 
> >> Transmitting to a server with different architecture?
> 
> > Yes,
> 
> > * different release
> > * different encoding
> > * different CPU architecture
> > * (with the correct transform) a different DBMS
> 
> The notion that the WAL logs will ever be portable across such
> differences is so ... so ... well, it's barely worth laughing at.

I expect to begin discussion of how that might be achieved in a few
days. I understand the starting place for your thinking, but am not
deterred because I see some solutions. I feel certain you will point out
some difficulties, but I think it is worth doing. Please lets start
those discussions from a neutral point. If you find a real showstopper,
then so be it. Sorry for the delay.

Bruce was asking about why I might want that. If we can at least agree
there is a use case then it helps.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Transaction-controlled robustness for replication

From

Robert Hodges

Date:

12 August 2008, 17:37:04

Hi Tom,

Could you expand on why logical application of WAL records is impractical in
these cases?  This is what Oracle does.  Moreover once you are into SQL a
lot of other use cases immediately become practical, such as large scale
master/slave set-ups for read scaling.

Thanks, Robert

On 8/12/08 12:40 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

> Simon Riggs <simon@2ndquadrant.com> writes:
>> On Tue, 2008-08-12 at 11:52 -0400, Bruce Momjian wrote:
>>> What is the attraction of logical application of the WAL logs?
>>> Transmitting to a server with different architecture?
>
>> Yes,
>
>> * different release
>> * different encoding
>> * different CPU architecture
>> * (with the correct transform) a different DBMS
>
> The notion that the WAL logs will ever be portable across such
> differences is so ... so ... well, it's barely worth laughing at.
>
>                         regards, tom lane
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


--
Robert Hodges, CTO, Continuent, Inc.
Email:  robert.hodges@continuent.com
Mobile:  +1-510-501-3728  Skype:  hodgesrm

Re: Transaction-controlled robustness for replication

From

Markus Wanner

Date:

12 August 2008, 18:28:58

Hi,

Robert Hodges wrote:
> Could you expand on why logical application of WAL records is impractical in
> these cases?  This is what Oracle does.  Moreover once you are into SQL a
> lot of other use cases immediately become practical, such as large scale
> master/slave set-ups for read scaling.

I cannot speak for Tom, but what strikes me as a strange approach here 
is using the WAL for "logical application" of changes. That's because 
the WAL is quite far away from SQL, and thus from a "logical 
representation" of the data. It's rather pretty physical, meaning it's 
bound to a certain Postgres release and CPU architecture.

A more "logical" exchange format certainly poses less problems across 
releases, encodings and CPU architectures. Or even across RDMSen. But 
hey, let's see what Simon comes up with...

Regards

Markus Wanner

Re: Transaction-controlled robustness for replication

From

Tom Lane

Date:

13 August 2008, 00:54:20

Markus Wanner <markus@bluegap.ch> writes:
> Robert Hodges wrote:
>> Could you expand on why logical application of WAL records is impractical in
>> these cases?  This is what Oracle does.  Moreover once you are into SQL a
>> lot of other use cases immediately become practical, such as large scale
>> master/slave set-ups for read scaling.

> I cannot speak for Tom, but what strikes me as a strange approach here 
> is using the WAL for "logical application" of changes. That's because 
> the WAL is quite far away from SQL, and thus from a "logical 
> representation" of the data. It's rather pretty physical, meaning it's 
> bound to a certain Postgres release and CPU architecture.

Right.  To take just one example: the value of MAXALIGN affects not only
how many tuples you can put on a heap page (thus changing TIDs of tuples,
which fundamentally breaks most of the current types of WAL records)
but how many tuples you can put on an index page (and thus index page
split decisions, and thereby pretty much every single fact about the
contents of upper btree levels).  We need not go into architecture
dependencies that are stronger than that one, though there are many.
As for version independence, who thinks they can WAL-replicate changes
of the system catalogs into a different version with significantly
different system catalogs?

You couldn't even begin to make this work with anything approaching the
current level of semantic detail of WAL entries.

What I think Simon was actually driving at was query-shipping, which is
not my idea of "WAL" at all.  It has some usefulness, but also a bunch
of downsides of its very own, mostly centered around reproducibility.
With the current WAL design I have some faith that the slaves reproduce
the contents of the master.  With any "logical replication" design that
becomes a faith of the religious kind, because it sure isn't provable.
        regards, tom lane

Re: Transaction-controlled robustness for replication

From

Alvaro Herrera

Date:

13 August 2008, 01:38:56

Tom Lane wrote:

> What I think Simon was actually driving at was query-shipping, which is
> not my idea of "WAL" at all.  It has some usefulness, but also a bunch
> of downsides of its very own, mostly centered around reproducibility.
> With the current WAL design I have some faith that the slaves reproduce
> the contents of the master.  With any "logical replication" design that
> becomes a faith of the religious kind, because it sure isn't provable.

Actually I think the idea here is to take certain WAL records, translate
them into "portable" constructs, ship them, and let the slave handle the
remaining tasks that need to be done with it.  For example you would
only ship heap insert, not index insert; the slave needs to take this
insert and derive the appropriate index operations that the slave needs.

What's needed here is a process that is able to read (or receive from
the backend calling the XLogInsert) WAL records and at the same time
have catalog access for getting type I/O data, converting relation OIDs
into table names if needed, etc.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Transaction-controlled robustness for replication

From

Robert Hodges

Date:

13 August 2008, 01:45:53

<font face="Verdana, Helvetica, Arial"><span style="font-size:12.0px">Hi Tom, <br /><br /> Part of this is semantics—I
likeSimon’s logical vs. physical terminology because it distinguishes neatly between replication that copies
implementationdown to OIDs etc. and replication that copies data content including schema changes but not
implementation. It seems a noble goal get both to work well, as they are quite complementary.   <br /><br /> There are
variousways to get information to recapitulate SQL, but piggy-backing off WAL record generation has a lot of
advantages. You at least have the data structures and don’t have to reverse-engineer log information on disk.  Of the
multipleways to build capable logical replication solutions, this seems to involve the least effort.  <br /><br /> My
companyis currently heads down building a solution for Oracle based on reading REDO log files.  It requires a master of
Oracledark arts to decode them and is also purely asynchronous.  PostgreSQL will eventually be far better as these
discussionsboil down into designs.  <br /><br /> Thanks, Robert<br /><br /> On 8/12/08 8:51 PM, "Tom Lane"
<tgl@sss.pgh.pa.us>wrote:<br /><br /></span></font><blockquote><font face="Verdana, Helvetica, Arial"><span
style="font-size:12.0px">MarkusWanner <markus@bluegap.ch> writes:<br /> > Robert Hodges wrote:<br /> >>
Couldyou expand on why logical application of WAL records is impractical in<br /> >> these cases?  This is what
Oracledoes.  Moreover once you are into SQL a<br /> >> lot of other use cases immediately become practical, such
aslarge scale<br /> >> master/slave set-ups for read scaling.<br /><br /> > I cannot speak for Tom, but what
strikesme as a strange approach here<br /> > is using the WAL for "logical application" of changes. That's
because<br/> > the WAL is quite far away from SQL, and thus from a "logical<br /> > representation" of the data.
It'srather pretty physical, meaning it's<br /> > bound to a certain Postgres release and CPU architecture.<br /><br
/>Right.  To take just one example: the value of MAXALIGN affects not only<br /> how many tuples you can put on a heap
page(thus changing TIDs of tuples,<br /> which fundamentally breaks most of the current types of WAL records)<br /> but
howmany tuples you can put on an index page (and thus index page<br /> split decisions, and thereby pretty much every
singlefact about the<br /> contents of upper btree levels).  We need not go into architecture<br /> dependencies that
arestronger than that one, though there are many.<br /> As for version independence, who thinks they can WAL-replicate
changes<br/> of the system catalogs into a different version with significantly<br /> different system catalogs?<br
/><br/> You couldn't even begin to make this work with anything approaching the<br /> current level of semantic detail
ofWAL entries.<br /><br /> What I think Simon was actually driving at was query-shipping, which is<br /> not my idea of
"WAL"at all.  It has some usefulness, but also a bunch<br /> of downsides of its very own, mostly centered around
reproducibility.<br/> With the current WAL design I have some faith that the slaves reproduce<br /> the contents of the
master. With any "logical replication" design that<br /> becomes a faith of the religious kind, because it sure isn't
provable.<br/><br />                         regards, tom lane<br /><br /></span></font></blockquote><font
face="Verdana,Helvetica, Arial"><span style="font-size:12.0px"><br /><br /> -- <br /> Robert Hodges, CTO, Continuent,
Inc.<br/> Email:  robert.hodges@continuent.com<br /> Mobile:  +1-510-501-3728  Skype:  hodgesrm<br /></span></font>

Re: Transaction-controlled robustness for replication

From

Tom Lane

Date:

13 August 2008, 01:50:20

Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tom Lane wrote:
>> What I think Simon was actually driving at was query-shipping, which is
>> not my idea of "WAL" at all.  It has some usefulness, but also a bunch
>> of downsides of its very own, mostly centered around reproducibility.

> Actually I think the idea here is to take certain WAL records, translate
> them into "portable" constructs, ship them, and let the slave handle the
> remaining tasks that need to be done with it.  For example you would
> only ship heap insert, not index insert; the slave needs to take this
> insert and derive the appropriate index operations that the slave needs.

Oooh, so we'll run user-defined index functions during WAL replay.
Yessir, *that* will be reliable and reproducible.

In any case, you didn't answer the point about heap TIDs not matching
across architectures.  That seems at minimum to require that UPDATE
and DELETE identify target tuples by primary key instead of TID.
Which requires for starters that all your tables *have* a primary
key, and for seconds that the replay environment be capable of
identifying the pkey and being able to do lookup operations using it.
Whether or not you'd like to mandate that every table have a pkey,
that certainly takes out any notion that a "WAL" of this species can
be used for crash recovery, because there is 0 chance that it will
work reliably in a crash-corrupted database.
        regards, tom lane

Re: Transaction-controlled robustness for replication

From

Alvaro Herrera

Date:

13 August 2008, 02:24:02

Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Tom Lane wrote:
> >> What I think Simon was actually driving at was query-shipping, which is
> >> not my idea of "WAL" at all.  It has some usefulness, but also a bunch
> >> of downsides of its very own, mostly centered around reproducibility.
> 
> > Actually I think the idea here is to take certain WAL records, translate
> > them into "portable" constructs, ship them, and let the slave handle the
> > remaining tasks that need to be done with it.  For example you would
> > only ship heap insert, not index insert; the slave needs to take this
> > insert and derive the appropriate index operations that the slave needs.
> 
> Oooh, so we'll run user-defined index functions during WAL replay.
> Yessir, *that* will be reliable and reproducible.

Hmm, I don't know what Simon was thinking, but I think it would be
acceptable to both parties to have WAL stay as it currently is (and used
for crash recovery), and only the logical record (derived from this WAL
record) be sent to the slave.

The reason for not having the "logical record" be shipped straight away
at the time of operation is that this system would need to resume
sending logical record if the shipping system crashed (i.e. some
committed transactions are in WAL but have not been shipped yet.)

> In any case, you didn't answer the point about heap TIDs not matching
> across architectures.  That seems at minimum to require that UPDATE
> and DELETE identify target tuples by primary key instead of TID.

Yep -- the PK would be required.  Alternatively, one could accept tables
that have no PKs but for which no UPDATEs or DELETEs are allowed on the
shipping system.  (This could be useful for log-type tables that you
want to replicate.)

> Which requires for starters that all your tables *have* a primary
> key, and for seconds that the replay environment be capable of
> identifying the pkey and being able to do lookup operations using it.

Possibly the requirement would be that the replay system would have the
same PK as the shipping system.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Transaction-controlled robustness for replication

From

Markus Wanner

Date:

13 August 2008, 04:36:05

Hi,

Robert Hodges wrote:
> I like Simon’s logical vs. physical terminology

So far, it seems to mainly have caused confusion (physical replication, 
but logical application? logical replication using WAL shipping?). At 
least I still prefer the more meaningful and descriptive terms, like 
"log shipping", "statement based replication" or "row based replication".

But maybe, what Simon is about to propose just doesn't fit into any of 
those categories. I have a similar problem with Postgres-R, which is 
somewhere in between synchronous and asynchronous.

Regards

Markus Wanner

Re: Transaction-controlled robustness for replication

From

Markus Wanner

Date:

13 August 2008, 06:06:33

Hi,

Alvaro Herrera wrote:
> Actually I think the idea here is to take certain WAL records, translate
> them into "portable" constructs, ship them, 

At which point it clearly shouldn't be called a WAL shipping method. 
What would it have to do with the WAL at all, then? Why translate from 
WAL records at all, better use the real tuples right away. (Almost 
needless to say that here, but obviously Postgres-R does it that way).

So far, Simon really seems to mean WAL shipping: "it allows WAL to be 
used as the replication transport", see [1].

Regards

Markus Wanner

[1]: mail to -hackers from Simon, Subject: "Plans for 8.4":
http://archives.postgresql.org/pgsql-hackers/2008-07/msg01010.php

Re: Transaction-controlled robustness for replication

From

Simon Riggs

Date:

13 August 2008, 06:07:29

On Tue, 2008-08-12 at 13:33 -0400, Bruce Momjian wrote:
> Simon Riggs wrote:
> > > > > with values of:
> > > > > 
> > > > >     nothing:    have network traffic send WAL as needed
> > > > >     netflush:    wait for flush of WAL network packets to slave
> > > > >     process:    wait for slave to process WAL traffic and
> > > > >             optionally fsync
> > > > 
> > > > Suggest 
> > > >   async
> > > >   syncnet
> > > >   syncdisk
> > > 
> > > I think the first two are fine, but 'syncdisk' might be wrong if the slave
> > > has 'synchronous_commit = off'.  Any ideas?
> > 
> > Yes, synchronous_commit can be set in the postgresql.conf, but its great
> > advantage is it is a userset parameter.
> > 
> > The main point of the post is that the parameter would be transaction
> > controlled, so *must* be set in the transaction and thus *must* be set
> > on the master. Otherwise the capability is not available in the way I am
> > describing.
> 
> Oh, so synchronous_commit would not control WAL sync on the slave?  What
> about our fsync parameter?  Because the slave is read-only, I saw no
> disadvantage of setting synchronous_commit to off in postgresql.conf on
> the slave.

The setting of synchronous_commit will be important if the standby
becomes the primary. I can see many cases where we might want "syncnet"
mode (i.e. no fsync of WAL data to disk on standby) and yet want
synchronous_commit=on when it becomes primary.

So if we were to use same parameters it would be confusing.

> > synchronous_commit applies to transaction commits. The code path would
> > be completely different here, so having parameter passed as an info byte
> > from master will not cause code structure problems or performance
> > problems.
> 
> OK, I was just trying to simplify it.  

I understood why you've had those thoughts and commend the lateral
thinking. I just don't think that on this occasion we've discovered any
better ways of doing it.

> The big problem with an async
> slave is that not only would you have lost data in a failover, but the
> database might be inconsistent, like fsync = off, which is something I
> think we want to try to avoid, which is why I was suggesting
> synchronous_commit = off.
> 
> Or were you thinking of always doing fsync on the slave, no matter what.
> I am worried the slave might not be able to keep up (being
> single-threaded) and therefore we should allow a way to async commit on
> the slave.  

Bit confused here. I've not said I want always async, neither have I
said I want always sync.

The main thing is we agree there will be 3 settings, including two
variants of synchronous replication one fairly safe and one ultra safe.

For the ultra safe mode we really need to see how synch replication will
work before we comment on where we might introduce fsyncs. I'm presuming
that incoming WAL will be written to WAL files (and optionally fsynced).
You might be talking about applying WAL records to the database and then
fsyncing them, but we do need to allow for crash recovery of the standby
server, so the data must be synced to WAL files before it is synced to
database.

> Certainly if the master is async sending the data, there is
> no need to do a synchronous_commit on the slave.

Agreed

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Transaction-controlled robustness for replication

From

Markus Wanner

Date:

13 August 2008, 06:27:48

Hi,

Robert Hodges wrote:
> Part of this is semantics—I like Simon’s logical vs. physical 
> terminology because it distinguishes neatly between replication that 
> copies implementation down to OIDs etc. and replication that copies data 
> content including schema changes but not implementation.

So far, these terms have mostly caused confusion for me: "logical 
replication using WAL shipping", "physical replication, but logical 
application"...

As Simon didn't explain in more details, what he has in mind, we all 
have our own and quite different interpretations. These terms obviously 
haven't helped to clarify the issue until now.

> It seems a 
> noble goal get both to work well, as they are quite complementary.

Agreed.

> There are various ways to get information to recapitulate SQL, but 
> piggy-backing off WAL record generation has a lot of advantages.  You at 
> least have the data structures and don’t have to reverse-engineer log 
> information on disk.  Of the multiple ways to build capable logical 
> replication solutions, this seems to involve the least effort.

We even have the real tuple, which is about the closest you can get to 
being a "logical representation". Using that clearly requires less 
efforts than converting a WAL record back to a logical tuple.

For example, it allows the optimization of sending only differences to 
the old tuple for UPDATES, instead of always sending full tuples - see 
Postgres-R for a partially working implementation.

> My company is currently heads down building a solution for Oracle based 
> on reading REDO log files.  It requires a master of Oracle dark arts to 
> decode them and is also purely asynchronous.

That sounds pretty challenging. Good luck!

Regards

Markus Wanner

Re: Transaction-controlled robustness for replication

From

Simon Riggs

Date:

13 August 2008, 10:04:49

On Wed, 2008-08-13 at 11:27 +0200, Markus Wanner wrote:
> Hi,
> 
> Robert Hodges wrote:
> > Part of this is semantics—I like Simon’s logical vs. physical 
> > terminology because it distinguishes neatly between replication that 
> > copies implementation down to OIDs etc. and replication that copies data 
> > content including schema changes but not implementation.
> 
> So far, these terms have mostly caused confusion for me: "logical 
> replication using WAL shipping", "physical replication, but logical 
> application"...
> 
> As Simon didn't explain in more details, what he has in mind, we all 
> have our own and quite different interpretations. These terms obviously 
> haven't helped to clarify the issue until now.

Classification of Replication Techniques
---------------------------------------

We can consider that there are two stages to replication
* transfer or shipping of data to second node(s)
* apply mechanism

Apply mechanism can be either logical, where we execute SQL, or
physical, where we bypass the SQL layer and do this at lower level.

It is possible to have
1. logical shipping, logical apply
2. physical shipping, logical apply
3. physical shipping, physical apply

Mammoth and Postgres-R are both type 1 replication systems, since they
stream data to second node in a form that makes SQL reconstruction
easier. Slony is also type 1, using triggers.
So we have another classification

1 Logical data shipping
a) Trigger based
b) Additional internals based approaches

"Warm Standby" log shipping is type 3. We ship the WAL and apply it
directly. Easy and good.

Type 2 is where you ship the WAL (efficient) then use it to reconstruct
SQL (flexible) and then apply that to other nodes. It is somewhat harder
than type 1, but requires less infrastructure (IMHO). Definitely
requires less data shipping from Primary node, so very possibly more
efficient. [Tom is absolutely right to say this is "impossible".
Currently, it is, but that's why we develop].

1a can't be synchronous by definition, but 1b and other approaches can
be. Otherwise above definitions not relevant to sync/async capability.

Note that DRBD is also a type 3 system, but uses filesystem level
physical data. Since this is outside of control of Postgres, I discount
this approach because it cannot deliver transaction-controlled synch
replication. 

Previously, most RDBMS vendors supported type 1a) systems. They have now
moved to type 2 and 3 systems. Both DB2 and Oracle support a type 2
*and* a type 3 replication system. The reasons they do this are valid
for us also, so I suggest that we do the same. So for me, it is not
about whether we do type 2 or type 3, I think we should do both.

Doing both may not be possible in this release: I have one potential
sponsor for type 2, and have made public call for work on query access
to type 3 systems. I recognise that some people that like type 3 do not
like type 2 and possibly vice versa.

If (not when) I do work on type 2 systems it will be only to provide
that as a "transport" option for other technologies. Not as a full-blown
replication system.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Transaction-controlled robustness for replication

From

Markus Wanner

Date:

13 August 2008, 10:38:39

Hi,

Simon Riggs wrote:
> Classification of Replication Techniques

Thanks for your classifications. It helps a great deal to clarify.

> Type 2 is where you ship the WAL (efficient) then use it to reconstruct
> SQL (flexible) and then apply that to other nodes. It is somewhat harder
> than type 1, but requires less infrastructure (IMHO). Definitely
> requires less data shipping from Primary node, so very possibly more
> efficient.

What leads you to that conclusion? AFAICT a logical format, specifically 
designed for replication is quite certainly more compact than the WAL 
(assuming that's what you mean by "less data").

The only efficiency gain I can see compared to type 1 is, that most of 
the processing work is offloaded from the master to the slave(s). For 
setups with multiple slaves, that's a bad trade-off, IMO.

> Previously, most RDBMS vendors supported type 1a) systems. They have now
> moved to type 2 and 3 systems. Both DB2 and Oracle support a type 2
> *and* a type 3 replication system. The reasons they do this are valid
> for us also, so I suggest that we do the same. So for me, it is not
> about whether we do type 2 or type 3, I think we should do both.

I currently don't think type 2 is doable with any reasonable effort, but 
hey, I'm always open for surprises. :-)

Which of IBM's and Oracle's products are you referring to?

Regards

Markus Wanner

Re: Transaction-controlled robustness for replication

From

Simon Riggs

Date:

13 August 2008, 10:53:18

On Wed, 2008-08-13 at 15:38 +0200, Markus Wanner wrote:

> Simon Riggs wrote:
> > Classification of Replication Techniques
> 
> Thanks for your classifications. It helps a great deal to clarify.
> 
> > Type 2 is where you ship the WAL (efficient) then use it to reconstruct
> > SQL (flexible) and then apply that to other nodes. It is somewhat harder
> > than type 1, but requires less infrastructure (IMHO). Definitely
> > requires less data shipping from Primary node, so very possibly more
> > efficient.
> 
> What leads you to that conclusion? AFAICT a logical format, specifically 
> designed for replication is quite certainly more compact than the WAL 
> (assuming that's what you mean by "less data").

Possibly, but since we are generating and writing WAL anyway that's not
a completely fair comparison.

> Which of IBM's and Oracle's products are you referring to?

IBM DB2 HADR, QReplication.
Oracle Streams 10g+, Data Guard Logical and Physical Standby
All of which I've personally used, except for Oracle Streams10g, which I
investigated thoroughly for a client about 4 years ago.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Transaction-controlled robustness for replication

From

Bruce Momjian

Date:

13 August 2008, 12:18:25

Simon Riggs wrote:
> 
> On Wed, 2008-08-13 at 15:38 +0200, Markus Wanner wrote:
> 
> > Simon Riggs wrote:
> > > Classification of Replication Techniques
> > 
> > Thanks for your classifications. It helps a great deal to clarify.
> > 
> > > Type 2 is where you ship the WAL (efficient) then use it to reconstruct
> > > SQL (flexible) and then apply that to other nodes. It is somewhat harder
> > > than type 1, but requires less infrastructure (IMHO). Definitely
> > > requires less data shipping from Primary node, so very possibly more
> > > efficient.
> > 
> > What leads you to that conclusion? AFAICT a logical format, specifically 
> > designed for replication is quite certainly more compact than the WAL 
> > (assuming that's what you mean by "less data").
> 
> Possibly, but since we are generating and writing WAL anyway that's not
> a completely fair comparison.
> 
> > Which of IBM's and Oracle's products are you referring to?
> 
> IBM DB2 HADR, QReplication.
> Oracle Streams 10g+, Data Guard Logical and Physical Standby
> All of which I've personally used, except for Oracle Streams10g, which I
> investigated thoroughly for a client about 4 years ago.

I think doing the WAL streaming and allowing a read-only slave is enough
work to keep Simon busy for quite some time.  I don't understand why the
logical issue is being discussed at this stage --- let's get the other
stuff done first.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Transaction-controlled robustness for replication

From

Bruce Momjian

Date:

13 August 2008, 12:27:38

Simon Riggs wrote:
> > > The main point of the post is that the parameter would be transaction
> > > controlled, so *must* be set in the transaction and thus *must* be set
> > > on the master. Otherwise the capability is not available in the way I am
> > > describing.
> > 
> > Oh, so synchronous_commit would not control WAL sync on the slave?  What
> > about our fsync parameter?  Because the slave is read-only, I saw no
> > disadvantage of setting synchronous_commit to off in postgresql.conf on
> > the slave.
> 
> The setting of synchronous_commit will be important if the standby
> becomes the primary. I can see many cases where we might want "syncnet"
> mode (i.e. no fsync of WAL data to disk on standby) and yet want
> synchronous_commit=on when it becomes primary.
> 
> So if we were to use same parameters it would be confusing.

I disagree.  If they make it the master they change the setting.

Also, if we choose async slave I would expect the consistency of
synchronous_commit rather than fsync=off's database inconsistency.

> > The big problem with an async
> > slave is that not only would you have lost data in a failover, but the
> > database might be inconsistent, like fsync = off, which is something I
> > think we want to try to avoid, which is why I was suggesting
> > synchronous_commit = off.
> > 
> > Or were you thinking of always doing fsync on the slave, no matter what.
> > I am worried the slave might not be able to keep up (being
> > single-threaded) and therefore we should allow a way to async commit on
> > the slave.  
> 
> Bit confused here. I've not said I want always async, neither have I
> said I want always sync.
> 
> The main thing is we agree there will be 3 settings, including two
> variants of synchronous replication one fairly safe and one ultra safe.
> 
> For the ultra safe mode we really need to see how synch replication will
> work before we comment on where we might introduce fsyncs. I'm presuming
> that incoming WAL will be written to WAL files (and optionally fsynced).
> You might be talking about applying WAL records to the database and then
> fsyncing them, but we do need to allow for crash recovery of the standby
> server, so the data must be synced to WAL files before it is synced to
> database.

I was talking about fsync'ing the WAL when each entry arrives from the
master.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Transaction-controlled robustness for replication

From

Simon Riggs

Date:

13 August 2008, 13:14:42

On Wed, 2008-08-13 at 11:17 -0400, Bruce Momjian wrote:

> I think doing the WAL streaming and allowing a read-only slave is
> enough work to keep Simon busy for quite some time.  I don't
> understand why the logical issue is being discussed at this stage ---
> let's get the other stuff done first.

I'm not working on WAL streaming. My plans assume someone else is and
I'll be as worried as you if the current answer is "nobody".

I expect to be involved in reviewing it fairly closely though...

Other than that, my time is limited only by how much funding I have. Or
put another way, the constraining factor is money, not available time. 

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Transaction-controlled robustness for replication

From

Simon Riggs

Date:

13 August 2008, 13:39:09

On Wed, 2008-08-13 at 11:27 -0400, Bruce Momjian wrote:

> I disagree.  If they make it the master they change the setting.

It's not acceptable to force people to edit a configuration file when
failover occurs. Some people wish to automate this and fumbling
parameter values at this important time is definitely inappropriate. We
gain nothing by asking people do do things that way.

Plus, as I have said: if it is controlled on the Standby then it will
not be transaction-controlled and this will be a useful thing. 

I asked myself "where would I like to be set?" The answer was "on the
master". If you think differently, please say why. Yes, we can set it on
the standby, but I see no reason to do so.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Transaction-controlled robustness for replication

From

Bruce Momjian

Date:

13 August 2008, 13:57:32

Simon Riggs wrote:
> 
> On Wed, 2008-08-13 at 11:17 -0400, Bruce Momjian wrote:
> 
> > I think doing the WAL streaming and allowing a read-only slave is
> > enough work to keep Simon busy for quite some time.  I don't
> > understand why the logical issue is being discussed at this stage ---
> > let's get the other stuff done first.
> 
> I'm not working on WAL streaming. My plans assume someone else is and
> I'll be as worried as you if the current answer is "nobody".

Oh, OK, good.  I was unclear on that.  NTT is working with EnterpriseDB
on the WAL steaming code to be released to the community.

> I expect to be involved in reviewing it fairly closely though...

Yes, of course.

> Other than that, my time is limited only by how much funding I have. Or
> put another way, the constraining factor is money, not available time. 

That is good, I think.  ;-)

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Transaction-controlled robustness for replication

From

Bruce Momjian

Date:

13 August 2008, 14:06:48

Simon Riggs wrote:
> 
> On Wed, 2008-08-13 at 11:27 -0400, Bruce Momjian wrote:
> 
> > I disagree.  If they make it the master they change the setting.
> 
> It's not acceptable to force people to edit a configuration file when
> failover occurs. Some people wish to automate this and fumbling
> parameter values at this important time is definitely inappropriate. We
> gain nothing by asking people do do things that way.
> 
> Plus, as I have said: if it is controlled on the Standby then it will
> not be transaction-controlled and this will be a useful thing. 
> 
> I asked myself "where would I like to be set?" The answer was "on the
> master". If you think differently, please say why. Yes, we can set it on
> the standby, but I see no reason to do so.

Ah, I can see an advantage for the failover case to have the slave wal
sync controlled from the master.  My original goal was just to reduce
GUC option bloat.  Let's get farther down the road on this and see how
it looks.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Transaction-controlled robustness for replication

From

Simon Riggs

Date:

13 August 2008, 16:33:49

On Wed, 2008-08-13 at 12:55 -0400, Bruce Momjian wrote:

> NTT is working with EnterpriseDB
> on the WAL steaming code to be released to the community.

Hopefully the code isn't steaming... :-)
and that we will be able to see detailed designs and code soon.

Might end up as a big pileup otherwise.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support