Re: Replication identifiers, take 4 - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Replication identifiers, take 4
Date
Msg-id 20150216002155.GI15326@awork2.anarazel.de
Whole thread Raw
In response to Replication identifiers, take 3  (Andres Freund <andres@2ndquadrant.com>)
Responses Re: Replication identifiers, take 4  (Andres Freund <andres@2ndquadrant.com>)
Re: Replication identifiers, take 4  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Re: Replication identifiers, take 4  (Andres Freund <andres@2ndquadrant.com>)
List pgsql-hackers
Hi,

Here's my next attept attempt at producing something we can agree
upon.

The major change that might achieve that is that I've now provided a
separate method to store the origin_id of a node. I've made it
conditional on !REPLICATION_IDENTIFIER_REUSE_PADDING, to show both
paths. That new method uses Heikki's xlog rework to dynamically add the
origin to the record if a origin is set up. That works surprisingly
simply.

Other changes:

* Locking preventing several backends to replay changes at the same time. This is actually overly restrictive in some
cases,but I think good enough for now.
 
* Logical decoding grew a filter_by_origin callback that allows to ignore changes that were replayed on a remote
system.Such filters are executed before much is done with records, potentially saving a fair bit of costs.
 
* Rebased. That took a bit due the xlog and other changes.
* A couple more SQL interface functions (like dropping a replication identifier).

I also want to quickly recap replication identifiers, given that
in-person conversations with several people proved that the concept was
slightly misunderstood:

Think about a logical replication solution trying to replay changes. The
postmaster in which the data is replayed into crashes every now and
then. Replication identifiers allow you to do something like:

do_replication()
{   source = ConnectToSourceSystem('mysource');   target = ConnectToSourceSystem('target');
   # mark we're replayin   target.exec($$SELECT pg_replication_identifier_setup_replaying_from('myrep_mysource')$$);
#get how far we've replayed last time round   remote_lsn = target.exec($$SELECT remote_lsn FROM
pg_get_replication_identifier_progressWHERE external_id = 'myrep_mysource');
 
   # and now replay changes   copystream = source.exec('START_LOGICAL_REPLICATION SLOT ... START %x', remote_lsn);
   while (record = copystream.get_record())   {      if (record.type = 'begin')      {
target.exec('BEGIN');            # setup the position of this individual xact             target.exec('SELECT
pg_replication_identifier_setup_tx_origin($1,$2);',                         record.origin_lsn,
record.origin_commit_timestamp);     }      else if (record.type = 'change')             target.exec(record.change_sql)
    else if (record.type = 'commit')             target.exec('COMMIT');   }
 
}

A non pseudocode version of the above would be safe against crashes of
both the source and the target system. If the target system crashes the
replication identifier logic will recover how far we replayed during
crash recovery. If the source system crashes/disconnects we'll have the
current value in memory. Note that this works perfectly well if the
target system (and obviously the source system, but that's obvious) use
synchronous_commit = off - we'll not miss any changes.

Furthermore the fact that the origin of records is recorded allows to
avoid decoding them in logical decoding. That has both efficiency
advantages (we can do so before they are stored in memory/disk) and
functionality advantages. Imagine using a logical replication solution
to replicate inserts to a single table between two databases where
inserts are allowed on both - unless you prevent the replicated inserts
from being replicated again you obviously have a loop. This
infrastructure lets you avoid that.

The SQL interface consists out of:
# manage existance of identifiers
internal_id pg_replication_identifier_create(external_id);
void pg_replication_identifier_drop(external_id);

# replay management
void pg_replication_identifier_setup_replaying_from(external_id);
void pg_replication_identifier_reset_replaying_from();
bool pg_replication_identifier_is_replaying();
void pg_replication_identifier_setup_tx_origin(remote_lsn, remote_commit_time);

# replication progress status view
SELECT * FROM pgreplication_identifier_progress;

# replicatation identifiers
SELECT * FROM pg_replication_identifier;


Petr has developed (for UDR, i.e. logical replication ontop of 9.4) a
SQL reimplementation of replication identifiers and that has proven that
for busier workloads doing a table update to store the replication
progress indeed has a noticeable overhead. Especially if there's some
longer running activity on the standby.

The bigger questions I have are:

1) Where to store the origin. I personally still think that using the  padding is fine. Now that I have proven that
it'spretty simple to  store additional information the argument that it might be needed for  something else doesn't
reallyhold anymore. But I can live with the  other solution as well - 3 bytes additional overhead ain't so bad.
 

2) If we go with the !REPLICATION_IDENTIFIER_REUSE_PADDING solution, do  we want to store the origin only on relevant
records?That'd be  XLOG_HEAP_INSERT/XLOG_HEAPMULTI_INSERT/XLOG_HEAP_UPDATE //
XLOG_XACT_COMMIT/XLOG_XACT_COMMIT_PREPARED.I'm thinking of something  like XLogLogOriginIfAvailable() before the
emittinglog  XLogInsert()s.
 

3) There should be a lwlock for the individual replication identifier  progress slots.

4) Right now identifier progress is stored during checkpoints in special  files - maybe it'd be better to store them
insidethe checkpoint  record somehow. We read that even after a clean shutdown, so that  should be fine.
 

5) I'm think there are issues with a streaming replication standby if  many identifiers are created/dropped. Those
shouldn'tbe too hard to  fix.
 

6) Obviously the hack in bootstrap.c to get riname marked NOT NULL isn't  acceptable. Either I need to implement
boostrapsupport for marking  varlenas NOT NULL as discussed nearby or replace the syscache lookup  with a index
lookup.

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Andrew Gierth
Date:
Subject: Re: Really bad blowups with hash outer join and nulls
Next
From: Andres Freund
Date:
Subject: Re: Replication identifiers, take 4