Thread: Logical replication and multimaster

Logical replication and multimaster

From
Konstantin Knizhnik
Date:
Hello all,

We have implemented ACID multimaster based on logical replication and 
our DTM (distributed transaction manager) plugin.
Good news is that it works and no inconsistency is detected.
But unfortunately it is very very slow...

At standalone PostgreSQL I am able to achieve about 30000 TPS with 10 
clients performing simple depbit-credit transactions.
And with multimaster consisting of three nodes spawned at the same 
system I got about 100 (one hundred) TPS.
There are two main reasons of such awful performance:

1. Logical replication serializes all transactions:  there is single 
connection between wal-sender and receiver BGW.
2. 2PC synchronizes transaction commit at all nodes.

None of these two reasons are show stoppers themselves.
If we remove DTM and do asynchronous logical replication then 
performance of multimaster is increased to 6000 TPS
(please notice that in this test all multimaster node are spawned at the 
same system, sharing its resources,
so 6k is not bad result comparing with 30k at standalone system).
And according to 2ndquadrant results, BDR performance is very close to 
hot standby.

On the other hand our previous experiments with DTM shows only about 2 
times slowdown comparing with vanilla PostgreSQL.
But result of combining DTM and logical replication is frustrating.

I wonder if it is principle limitation of logical replication approach 
which is efficient only for asynchronous replication or it can be 
somehow tuned/extended to efficiently support synchronous replication?

We have also considered alternative approaches:
1. Statement based replication.
2. Trigger-based replication.
3. Replication using custom nodes.

In case of statement based replication it is hard to guarantee identity 
of of data at different nodes.
Approaches 2 and 3 are much harder to implement and requiring to 
"reinvent" substantial part of logical replication.
Them also require some kind of connection pool which can be used to send 
replicated transactions to the peer nodes (to avoid serialization of 
parallel transactions as in case of logical replication).

But looks like there is not so much sense in having multiple network 
connection between one pair of nodes.
It seems to be better to have one connection between nodes, but provide 
parallel execution of received transactions at destination side. But it 
seems to be also nontrivial. We have now in PostgreSQL some 
infrastructure for background works, but there is still no abstraction 
of workers pool and job queue which can provide simple way to organize 
parallel execution of some jobs. I wonder if somebody is working now on 
it or we should try to propose our solution?

Best regards,
Konstantin





Re: Logical replication and multimaster

From
Robert Haas
Date:
On Mon, Nov 30, 2015 at 11:20 AM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
> We have implemented ACID multimaster based on logical replication and our
> DTM (distributed transaction manager) plugin.
> Good news is that it works and no inconsistency is detected.
> But unfortunately it is very very slow...
>
> At standalone PostgreSQL I am able to achieve about 30000 TPS with 10
> clients performing simple depbit-credit transactions.
> And with multimaster consisting of three nodes spawned at the same system I
> got about 100 (one hundred) TPS.
> There are two main reasons of such awful performance:
>
> 1. Logical replication serializes all transactions:  there is single
> connection between wal-sender and receiver BGW.
> 2. 2PC synchronizes transaction commit at all nodes.
>
> None of these two reasons are show stoppers themselves.
> If we remove DTM and do asynchronous logical replication then performance of
> multimaster is increased to 6000 TPS
> (please notice that in this test all multimaster node are spawned at the
> same system, sharing its resources,
> so 6k is not bad result comparing with 30k at standalone system).
> And according to 2ndquadrant results, BDR performance is very close to hot
> standby.

Logical decoding only begins decoding a transaction once the
transaction is complete.  So I would guess that the sequence of
operations here is something like this - correct me if I'm wrong:

1. Do the transaction.
2. PREPARE.
3. Replay the transaction.
4. PREPARE the replay.
5. COMMIT PREPARED on original machine.
6. COMMIT PREPARED on replica.

Step 3 introduces latency proportional to the amount of work the
transaction did, which could be a lot.   If you were doing synchronous
physical replication, the replay of the COMMIT record would only need
to wait for the replay of the commit record itself.  But with
synchronous logical replication, you've got to wait for the replay of
the entire transaction.  That's a major bummer, especially if replay
is single-threaded and there a large number of backends generating
transactions.  Of course, the 2PC dance itself can also add latency -
that's most likely to be the issue if the transactions are each very
short.

What I'd suggest is trying to measure where the latency is coming
from.  You should be able to measure how much time each transaction
spends (a) executing, (b) preparing itself, (c) waiting for the replay
thread to begin replaying it, (d) waiting for the replay thread to
finish replaying it, and (e) committing.  Separating (c) and (d) might
be a little bit tricky, but I bet it's worth putting some effort in,
because the answer is probably important to understanding what sort of
change will help here.  If (c) is the problem, you might be able to
get around it by having multiple processes, though that only helps if
applying is slower than decoding.  But if (d) is the problem, then the
only solution is probably to begin applying the transaction
speculatively before it's prepared/committed.  I think.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Logical replication and multimaster

From
Konstantin Knizhnik
Date:
<div class="moz-cite-prefix">Thank you for reply.<br /><br /> On 12/02/2015 08:30 PM, Robert Haas wrote:<br
/></div><blockquotecite="mid:CA+TgmoY1o3G0B-21zv2Pw5iEkpR8=J42GdsUOs4m0inKka3FEA@mail.gmail.com" type="cite"><br /><pre
wrap="">
Logical decoding only begins decoding a transaction once the
transaction is complete.  So I would guess that the sequence of
operations here is something like this - correct me if I'm wrong:

1. Do the transaction.
2. PREPARE.
3. Replay the transaction.
4. PREPARE the replay.
5. COMMIT PREPARED on original machine.
6. COMMIT PREPARED on replica.
</pre></blockquote><br /> Logical decoding is started after execution of XLogFlush method.<br /> So atually transaction
isnot yet completed at this moment:<br /> - it is not marked as committed in clog<br /> - It is marked as in-progress
inprocarray<br /> - locks are not released<br /><br /> We are not using PostgreSQL two-phase commit here.<br /> Instead
ofour DTM catches control in TransactionIdCommitTree and sends request to arbiter which in turn wait status of
committingtransactions on replicas.<br /> The problem is that transactions are delivered to replica through single
channel:logical replication slot.<br /> And while such transaction is waiting acknowledgement from arbiter, it is
blockingreplication channel preventing other (parallel transactions)  from been replicated and applied.<br /><br /> I
haveimplemented pool of background workers. May be it will be useful not only for me.<br /> It consists of one
produces-multipleconsumers queue implemented using buffer in shared memory, spinlock and two semaphores.<br /> API is
verysimple:<br /><br /> typedef void(*BgwPoolExecutor)(int id, void* work, size_t size);<br /> typedef
BgwPool*(*BgwPoolConstructor)(void);<br/><br /> extern void BgwPoolStart(int nWorkers, BgwPoolConstructor
constructor);<br/> extern void BgwPoolInit(BgwPool* pool, BgwPoolExecutor executor, char const* dbname, size_t
queueSize);<br/> extern void BgwPoolExecute(BgwPool* pool, void* work, size_t size);<br /><br /> You just place in this
queuesome bulk of bytes (work, size), it is placed in queue and then first available worker will dequeue it and
execute.<br/><br /> Using this pool and larger number of accounts (reducing possibility of conflict), I get better
results.<br/> So now receiver of logical replication is not executing transactions directly, instead of it receiver is
placingthem in queue and them are executed concurrent by pool of background workers.<br /><br /> At cluster with three
nodesresults of out debit-credit benchmark are the following:<br /><br /><table border="1" cellpadding="2"
cellspacing="2"height="112" width="366"><tbody><tr><td valign="top"><br /></td><td valign="top">TPS<br
/></td></tr><tr><tdvalign="top">Multimaster (ACID transactions)<br /></td><td align="right" valign="top">12500<br
/></td></tr><tr><tdvalign="top">Multimaster (async replication)<br /></td><td align="right" valign="top">34800<br
/></td></tr><tr><tdvalign="top">Standalone PostgreSQL<br /></td><td align="right" valign="top">44000<br
/></td></tr></tbody></table><br/><br /> We tested two modes: when client randomly distribute queries between cluster
nodesand when client is working only with one master nodes and other are just used as replicas. Performance is slightly
betterin the second case, but the difference is not very large (about 11000 TPS in first case).<br /><br /> Number of
workersin pool has signficant imact on performance: with 8 workers we get about 7800 TPS and with 16 workers -
12500.<br/> Also performance greatly depends on number of accounts (and so probability of lock conflicts). In case of
100accounts speed is less than 1000 TPS.<br /><br /><br /><blockquote
cite="mid:CA+TgmoY1o3G0B-21zv2Pw5iEkpR8=J42GdsUOs4m0inKka3FEA@mail.gmail.com"type="cite"><pre wrap="">
 
Step 3 introduces latency proportional to the amount of work the
transaction did, which could be a lot.   If you were doing synchronous
physical replication, the replay of the COMMIT record would only need
to wait for the replay of the commit record itself.  But with
synchronous logical replication, you've got to wait for the replay of
the entire transaction.  That's a major bummer, especially if replay
is single-threaded and there a large number of backends generating
transactions.  Of course, the 2PC dance itself can also add latency -
that's most likely to be the issue if the transactions are each very
short.

What I'd suggest is trying to measure where the latency is coming
from.  You should be able to measure how much time each transaction
spends (a) executing, (b) preparing itself, (c) waiting for the replay
thread to begin replaying it, (d) waiting for the replay thread to
finish replaying it, and (e) committing.  Separating (c) and (d) might
be a little bit tricky, but I bet it's worth putting some effort in,
because the answer is probably important to understanding what sort of
change will help here.  If (c) is the problem, you might be able to
get around it by having multiple processes, though that only helps if
applying is slower than decoding.  But if (d) is the problem, then the
only solution is probably to begin applying the transaction
speculatively before it's prepared/committed.  I think.

</pre></blockquote><br />

Re: Logical replication and multimaster

From
Craig Ringer
Date:
On 1 December 2015 at 00:20, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:

We have implemented ACID multimaster based on logical replication and our DTM (distributed transaction manager) plugin.

What are you using for an output plugin and for replay?

I'd really like to collaborate using pglogical_output if at all possible. Petr's working really hard to get the pglogical downstrem out too, with me helping where I can.

I'd hate to be wasting time and effort working in parallel on overlapping functionality. I did a LOT of work to make pglogical_output extensible and reusable for different needs, with hooks used heavily instead of making things specific to the pglogical downstream. A protocol documented in detail. A json output mode as an option. Parameters for clients to negotiate options. etc.

Would a different name for the upstream output plugin help?
 
And according to 2ndquadrant results, BDR performance is very close to hot standby.

Yes... but it's asynchronous multi-master. Very different to what you're doing.
 
I wonder if it is principle limitation of logical replication approach which is efficient only for asynchronous replication or it can be somehow tuned/extended to efficiently support synchronous replication?

I'm certain there are improvements to be made for synchronous replication.

We have also considered alternative approaches:
1. Statement based replication.

Just don't go there. Really.
 
It seems to be better to have one connection between nodes, but provide parallel execution of received transactions at destination side.

I agree. This is something I'd like to be able to do through logical decoding. As far as I can tell there's no fundamental barrier to doing so, though there are a few limitations when streaming logical xacts:

- We can't avoid sending transactions that get rolled back

- We can't send the commit timestamp, commit LSN, etc at BEGIN time, so last-update-wins
  conflict resolution can't be done based on commit timestamp

- When streaming, the xid must be in each message, not just in begin/commit.

- The apply process can't use the SPI to apply changes directly since we can't multiplex transactions. It'll need to use
  shmem to communicate with a pool of workers, dispatching messages to workers as they arrive. Or it can multiplex
  a set of libpq connections in async mode, which I suspect may prove to be better.

I've made provision for streaming support in the pglogical_output extension. It'll need core changes to allow logical decoding to stream changes though.

Separately, I'd also like to look at decoding and sending sequence advances, which are something that happens outside transaction boundaries.

 
We have now in PostgreSQL some infrastructure for background works, but there is still no abstraction of workers pool and job queue which can provide simple way to organize parallel execution of some jobs. I wonder if somebody is working now on it or we should try to propose our solution?

I think a worker pool would be quite useful to have.

For BDR and for pglogical we had to build an infrastructure on top of static and dynamic bgworkers. A static worker launches a dynamic bgworker for each database. The dynamic bgworker for the database looks at extension-provided user catalogs to determine whether it should launch more dynamic bgworkers for each connection to a peer node.

Because the bgworker argument is a single by-value Datum the argument passed is an index into a static shmem array of structs. The struct is populated with the target database oid (or name, for 9.4, due to bgworker API limitations) and other info needed to start the worker.

Because registered static and dynamic bgworkers get restarted by the postmaster after a crash/restart cycle, and the restarted static worker will register new dynamic workers after restart, we have to jump through some annoying hoops to avoid duplicate bgworkers. A generation counter is stored in postmaster memory and incremented on crash recovery then copied to shmem. The high bits of the Datum argument to the workers embeds the generation counter. They compare their argument's counter to the one in shmem and exit if the counter differs, so the relaunched old generation of workers exits after a crash/restart cycle. See the thread on BGW_NO_RESTART_ON_CRASH for details.

In pglogical we're instead using BGW_NEVER_RESTART workers and doing restarts ourselves when needed, ignoring the postmaster's ability to restart bgworkers when the worker crashes.

It's likely that most projects using bgworkers for this sort of thing will need similar functionality, so generalizing it into a worker pool API makes a lot of sense. In the process we could really use API to examine currently registered and running bgworkers. Interested in collaborating on that?

Another thing I've wanted as part of this work is a way to get a one-time authentication cookie from the server that can be passed as a libpq connection option to get a connection without having to know a password or otherwise mess with pg_hba.conf. Basically a way to say "I'm a bgworker running with superuser rights within Pg, and I want to make a libpq connection to this database. I'm inherently trusted, so don't mess with pg_hba.conf and passwords, just let me in".

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Logical replication and multimaster

From
Craig Ringer
Date:


On 3 December 2015 at 04:18, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:
 
The problem is that transactions are delivered to replica through single channel: logical replication slot.
And while such transaction is waiting acknowledgement from arbiter, it is blocking replication channel preventing other (parallel transactions)  from been replicated and applied.

Streaming interleaved xacts from the slot as discussed in the prior mail would help there.

You'd continue to apply concurrent work from other xacts, and just handle commit messages as they arrive, sending the confirmations back through the DTM API.
 
I have implemented pool of background workers. May be it will be useful not only for me.

Excellent.

It should be possible to make that a separate extension. You can use C functions from other extensions by exposing a single pg_proc function with 'internal' return type that populates a struct of function pointers for the API. A single DirectFunctionCall lets you get the API struct. That's how pglogical_output handles hooks. The main downside is that you can't do that without a connection to a database with the extension installed so the pg_proc entry is exposed.

So it could make more sense to just keep it as a separate .c / .h file that's copied into trees that use it. Simpler and easier, but uglier.
 
It consists of one produces-multiple consumers queue implemented using buffer in shared memory, spinlock and two semaphores.
[snip] 
You just place in this queue some bulk of bytes (work, size), it is placed in queue and then first available worker will dequeue it and execute.

Very nice.

To handle xact streaming  I think you're likely to need a worker dispatch key too, where the dispatch keys are "sticky" to a given worker. So you assign xid 1231 to a worker at BEGIN. Send all work to the pool and everything with xid 1231 goes to that worker. At commit you clear the assignment of xis 1231.

Alternately a variant of the Execute method that lets you dispatch to a specific worker would do the job.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Logical replication and multimaster

From
Craig Ringer
Date:
On 3 December 2015 at 01:30, Robert Haas <robertmhaas@gmail.com> wrote:
 
1. Do the transaction.
2. PREPARE.
3. Replay the transaction.

As Konstantin noted they aren't using Pg's 2PC. They actually couldn't if they wanted to because logical decoding does not support decoding an xact at PREPARE TRANSACTION time, without COMMIT PREPARED.

I'd love to change that and allow decoding at PREPARE TRANSACTION time - or streaming the xact from the start, as discussed in the prior mail. This would be a huge help for doing consensus operations on an otherwise asynchronous cluster, like making table structure changes. You'd decode the prepared xact, replay it, prepare it on all nodes, then commit prepared when all nodes confirm successful prepare.

IIRC the main issue with this is that the prepared xact continues to hold locks so logical decoding can't acquire the locks it needs to decode the xact.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Logical replication and multimaster

From
konstantin knizhnik
Date:

On Dec 3, 2015, at 4:09 AM, Craig Ringer wrote:

On 1 December 2015 at 00:20, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:

We have implemented ACID multimaster based on logical replication and our DTM (distributed transaction manager) plugin.

What are you using for an output plugin and for replay?

I have implemented output plugin for multimaster based on Michael's decoder_raw+receiver_raw.
Right now it decodes WAL into correspondent SQL insert/update statements.
Certainly it is very inefficient way and in future I will replace it with some binary protocol, as it is used for example in BDR
(but BDR plugin contains a lot of stuff related with detecting and handling conflicts which is not relevant for multimaster).
But right now performance of Multimaster is not limited by logical replication protocol - if I remove DTM and use asynchronous replication (lightweight version of BDR:)
then I get 38k TPS instead of 12k.



I'd really like to collaborate using pglogical_output if at all possible. Petr's working really hard to get the pglogical downstrem out too, with me helping where I can.

I'd hate to be wasting time and effort working in parallel on overlapping functionality. I did a LOT of work to make pglogical_output extensible and reusable for different needs, with hooks used heavily instead of making things specific to the pglogical downstream. A protocol documented in detail. A json output mode as an option. Parameters for clients to negotiate options. etc.

Would a different name for the upstream output plugin help?


And where I can get  pglogical_output plugin? Sorry, but I can't quickly find reference with Google...
Also I wonder if this plugin perform DDL replication (most likely not). But then naive question - why DDL was excluded from logical replication protocol?
Are there some principle problems with it? In BDR it was handled in alternative way, using executor callback. It will be much easier if DDL can be replicated in the same way as normal SQL statements.


 
And according to 2ndquadrant results, BDR performance is very close to hot standby.

Yes... but it's asynchronous multi-master. Very different to what you're doing.
 
I wonder if it is principle limitation of logical replication approach which is efficient only for asynchronous replication or it can be somehow tuned/extended to efficiently support synchronous replication?

I'm certain there are improvements to be made for synchronous replication.

We have also considered alternative approaches:
1. Statement based replication.

Just don't go there. Really.
 
It seems to be better to have one connection between nodes, but provide parallel execution of received transactions at destination side.

I agree. This is something I'd like to be able to do through logical decoding. As far as I can tell there's no fundamental barrier to doing so, though there are a few limitations when streaming logical xacts:

- We can't avoid sending transactions that get rolled back

- We can't send the commit timestamp, commit LSN, etc at BEGIN time, so last-update-wins
  conflict resolution can't be done based on commit timestamp

- When streaming, the xid must be in each message, not just in begin/commit.

- The apply process can't use the SPI to apply changes directly since we can't multiplex transactions. It'll need to use
  shmem to communicate with a pool of workers, dispatching messages to workers as they arrive. Or it can multiplex
  a set of libpq connections in async mode, which I suspect may prove to be better.

I've made provision for streaming support in the pglogical_output extension. It'll need core changes to allow logical decoding to stream changes though.

Separately, I'd also like to look at decoding and sending sequence advances, which are something that happens outside transaction boundaries.

 
We have now in PostgreSQL some infrastructure for background works, but there is still no abstraction of workers pool and job queue which can provide simple way to organize parallel execution of some jobs. I wonder if somebody is working now on it or we should try to propose our solution?

I think a worker pool would be quite useful to have.

For BDR and for pglogical we had to build an infrastructure on top of static and dynamic bgworkers. A static worker launches a dynamic bgworker for each database. The dynamic bgworker for the database looks at extension-provided user catalogs to determine whether it should launch more dynamic bgworkers for each connection to a peer node.

Because the bgworker argument is a single by-value Datum the argument passed is an index into a static shmem array of structs. The struct is populated with the target database oid (or name, for 9.4, due to bgworker API limitations) and other info needed to start the worker.

Because registered static and dynamic bgworkers get restarted by the postmaster after a crash/restart cycle, and the restarted static worker will register new dynamic workers after restart, we have to jump through some annoying hoops to avoid duplicate bgworkers. A generation counter is stored in postmaster memory and incremented on crash recovery then copied to shmem. The high bits of the Datum argument to the workers embeds the generation counter. They compare their argument's counter to the one in shmem and exit if the counter differs, so the relaunched old generation of workers exits after a crash/restart cycle. See the thread on BGW_NO_RESTART_ON_CRASH for details.

In pglogical we're instead using BGW_NEVER_RESTART workers and doing restarts ourselves when needed, ignoring the postmaster's ability to restart bgworkers when the worker crashes.

It's likely that most projects using bgworkers for this sort of thing will need similar functionality, so generalizing it into a worker pool API makes a lot of sense. In the process we could really use API to examine currently registered and running bgworkers. Interested in collaborating on that?

Another thing I've wanted as part of this work is a way to get a one-time authentication cookie from the server that can be passed as a libpq connection option to get a connection without having to know a password or otherwise mess with pg_hba.conf. Basically a way to say "I'm a bgworker running with superuser rights within Pg, and I want to make a libpq connection to this database. I'm inherently trusted, so don't mess with pg_hba.conf and passwords, just let me in".

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Logical replication and multimaster

From
konstantin knizhnik
Date:
On Dec 3, 2015, at 4:18 AM, Craig Ringer wrote:

> Excellent.
>
> It should be possible to make that a separate extension. You can use C functions from other extensions by exposing a
singlepg_proc function with 'internal' return type that populates a struct of function pointers for the API. A single
DirectFunctionCalllets you get the API struct. That's how pglogical_output handles hooks. The main downside is that you
can'tdo that without a connection to a database with the extension installed so the pg_proc entry is exposed. 


Actually, working under cluster and columnar storage extension I got several questions about PostgreSQL infrastructure.
I always found some workarounds, but may it is better to ask community about it:)

1. Why there is no "conditional event" synchronization primitive in PostgreSQL. There is latch, but it is implemented
usingsockets and I afraid that it is not very fast. 
It will be nice to have some fast primitive like pthread condition variables.

2. PostgreSQL semaphores seems to be not intended for external use outside PostgreSQL core  (for example in
extensions).
There is no way to request additional amount of semaphores. Right now semaphores are allocated based on maximal number
ofbackends and spinlocks. 
And a semaphore as well as event is very popular and convenient synchronization primitive required in many cases.

3. What is the right way of creation of background worker requiring access to shared memory, i.e. having control
structurein main memory? 
As far as I understand background workers have to be registered either PG_init, either outside Postmaster environment.
If extension requires access to shared memory, then it should be registered in shared_preload_libraries list and should
beinitialized using shmem_startup hook. 
Something like this:

void _PG_init(void)
{if (!process_shared_preload_libraries_in_progress)    return;       ...prev_shmem_startup_hook =
shmem_startup_hook;shmem_startup_hook= My_shmem_startup; 
}

My_shmem_startup is needed because in _PG_init it is not possible to allocate shared memory.
So if I need to allocate some control structure for background workers  in shared memory, then I should do it in
My_shmem_startup.
But I can not register background workers in My_shmem_startup! I will get "must be registered in
shared_preload_libraries"error: 

void
RegisterBackgroundWorker(BackgroundWorker *worker)
{if (!process_shared_preload_libraries_in_progress){    if (!IsUnderPostmaster)        ereport(LOG,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),                errmsg("background worker \"%s\": must be registered in
shared_preload_libraries",                       worker->bgw_name)));    return;} 
}

So I have to register background workers in PG_init while control structure for them is not yet ready.
When I have implemented pool of background workers, I solved this problem by proving function which return address of
controlstructure later - when it will be actually allocated. 
But it seems to be some design flaw in BGW, isn' it?






Re: Logical replication and multimaster

From
Craig Ringer
Date:
On 3 December 2015 at 14:54, konstantin knizhnik <k.knizhnik@postgrespro.ru> wrote:

I'd really like to collaborate using pglogical_output if at all possible. Petr's working really hard to get the pglogical downstrem out too, with me helping where I can.
And where I can get  pglogical_output plugin? Sorry, but I can't quickly find reference with Google...

It's been submitted to this CF.



Any tests and comments would be greatly appreciated.

I have a version compatible with 9.4 and older in a separate tree I want to make public. I'll get back to you on that later today. It's  the same code with a few more ifdefs and an uglier structure for the example hooks module (because it can be a separate contrib)¸so it's not that exciting.

You should be able to just "git remote add" that repo, "git fetch" and "git merge dev/pglogical-output" into your working tree.
 
Also I wonder if this plugin perform DDL replication (most likely not).

No, it doesn't. The way it's done in BDR is too intrusive and has to be reworked before it can be made more generally re-usable.

How I envision DDL replication working for pglogical (or anything else) is to take the DDL hooks added in 9.5 and use them with a separate DDL deparse extension based on Álvaro's deparse work. If you want to replicate DDL you make sure this extension is loaded then use it from your event triggers to capture DDL in a useful form and write it to a queue table where your downstream client can find it and consume it. That way the deparse code doesn't have to be embedded in the Pg backend like it is in BDR, and instead can be a reusable extension.

But then naive question - why DDL was excluded from logical replication protocol?

logical decoding can't handle DDL because all it sees is the effects of that DDL in the xlog as a series of changes to catalog tables, relfilenode changes, etc. It can't turn that back into the original DDL in any kind of reliable way. A downstream can't do very much with "rename relfilenode 1231 to 1241".

There are a few cases we might want to handle through decoding - in particular I'd like to be able to decode changes to rows in shared catalogs like pg_authid, since we can't handle that with DDL deparse. For things like DROP TABLE, CREATE TABLE, etc we really need DDL hooks. At least as I currently understand things.

So we try to capture DDL at a higher level. That's why event triggers were added (http://www.postgresql.org/docs/current/static/event-triggers.html) and why DDL deparse was implemented (https://commitfest-old.postgresql.org/action/patch_view?id=1610).

You can't just capture the raw DDL statement since there are issues with search_path normalization, etc. Similar problems to statement based replication exist. Deparse is required to get the DDL after it's converted to a utility statement so we can obtain it in an unambiguous form.

I'll add some explanation in pglogical_output's DESIGN.md for why DDL is not currently handled.

BTW, TRUNCATE _is_ handled by the way. In pglogical we use regular TRUNCATE triggers (marked tgisinternal) for that. There are some significant complexities around foreign keys, sequence reset, etc, which are not fully handled yet.

Are there some principle problems with it? In BDR it was handled in alternative way, using executor callback. It will be much easier if DDL can be replicated in the same way as normal SQL statements.

It can't. I wish it could.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Logical replication and multimaster

From
"Shulgin, Oleksandr"
Date:
On Thu, Dec 3, 2015 at 8:34 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
On 3 December 2015 at 14:54, konstantin knizhnik <k.knizhnik@postgrespro.ru> wrote:

Are there some principle problems with it? In BDR it was handled in alternative way, using executor callback. It will be much easier if DDL can be replicated in the same way as normal SQL statements.

It can't. I wish it could.

That reminds me of that DDL deparsing patch I was trying to revive a while ago.  Strangely, I cannot find it in any of the commit fests.  Will add it.

--
Alex

Re: Logical replication and multimaster

From
konstantin knizhnik
Date:

On Dec 3, 2015, at 10:34 AM, Craig Ringer wrote:

On 3 December 2015 at 14:54, konstantin knizhnik <k.knizhnik@postgrespro.ru> wrote:

I'd really like to collaborate using pglogical_output if at all possible. Petr's working really hard to get the pglogical downstrem out too, with me helping where I can.
And where I can get  pglogical_output plugin? Sorry, but I can't quickly find reference with Google...

It's been submitted to this CF.



Any tests and comments would be greatly appreciated.


Thank you.
I wonder if there is opposite part of the pipe for pglogical_output - analog of receiver_raw?



--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Logical replication and multimaster

From
Simon Riggs
Date:
On 3 December 2015 at 12:06, konstantin knizhnik <k.knizhnik@postgrespro.ru> wrote:

On Dec 3, 2015, at 10:34 AM, Craig Ringer wrote:

On 3 December 2015 at 14:54, konstantin knizhnik <k.knizhnik@postgrespro.ru> wrote:

I'd really like to collaborate using pglogical_output if at all possible. Petr's working really hard to get the pglogical downstrem out too, with me helping where I can.
And where I can get  pglogical_output plugin? Sorry, but I can't quickly find reference with Google...

It's been submitted to this CF.



Any tests and comments would be greatly appreciated.


Thank you.
I wonder if there is opposite part of the pipe for pglogical_output - analog of receiver_raw?

Yes, there is. pglogical is currently in test and will be available sometime soon. 

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Logical replication and multimaster

From
Simon Riggs
Date:
On 30 November 2015 at 17:20, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:
 
But looks like there is not so much sense in having multiple network connection between one pair of nodes.
It seems to be better to have one connection between nodes, but provide parallel execution of received transactions at destination side. But it seems to be also nontrivial. We have now in PostgreSQL some infrastructure for background works, but there is still no abstraction of workers pool and job queue which can provide simple way to organize parallel execution of some jobs. I wonder if somebody is working now on it or we should try to propose our solution?

There are definitely two clear places where additional help would be useful and welcome right now.

1. Allowing logical decoding to have a "speculative pre-commit data" option, to allow some data to be made available via the decoding api, allowing data to be transferred prior to commit. This would allow us to reduce the delay that occurs at commit, especially for larger transactions or very low latency requirements for smaller transactions. Some heuristic or user interface would be required to decide whether to and which transactions might make their data available prior to commit. And we would need to send abort messages should the transactions not commit as expected. That would be a patch on logical decoding and is an essentially separate feature to anything currently being developed.
 
2. Some mechanism/theory to decide when/if to allow parallel apply. That could be used for both physical and logical replication. Since the apply side of logical replication is still being worked on there is a code dependency there, so a working solution isn't what is needed yet. But the general principles and any changes to the data content (wal_level) or protocol (pglogical_output) would be useful.

We already have working multi-master that has been contributed to PGDG, so contributing that won't gain us anything. There is a lot of code and pglogical is the most useful piece of code to be carved off and reworked for submission. The bottleneck is review and commit, not initial development - which applies both to this area and most others in PostgreSQL.

Having a single network connection between nodes would increase efficiency but also increase replication latency, so its not useful in all cases.

I think having some kind of message queue between nodes would also help, since there are many cases for which we want to transfer data, not just a replication data flow. For example, consensus on DDL, or MPP query traffic. But that is open to wider debate.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Logical replication and multimaster

From
Craig Ringer
Date:
On 3 December 2015 at 15:27, konstantin knizhnik <k.knizhnik@postgrespro.ru> wrote:

On Dec 3, 2015, at 4:18 AM, Craig Ringer wrote:

> Excellent.
>
> It should be possible to make that a separate extension. You can use C functions from other extensions by exposing a single pg_proc function with 'internal' return type that populates a struct of function pointers for the API. A single DirectFunctionCall lets you get the API struct. That's how pglogical_output handles hooks. The main downside is that you can't do that without a connection to a database with the extension installed so the pg_proc entry is exposed.


Actually, working under cluster and columnar storage extension I got several questions about PostgreSQL infrastructure.
I always found some workarounds, but may it is better to ask community about it:)

1. Why there is no "conditional event" synchronization primitive in PostgreSQL. There is latch, but it is implemented using sockets and I afraid that it is not very fast.
It will be nice to have some fast primitive like pthread condition variables.

The need for IPC makes things a bit more complex. Most places can get away with using a latch, testing one or more conditions, and resuming waiting.

While what you describe sounds possibly nice is there any evidence that it's a bottleneck or performance issue? Or is this premature optimisation at work?
 
2. PostgreSQL semaphores seems to be not intended for external use outside PostgreSQL core  (for example in extensions).
There is no way to request additional amount of semaphores. Right now semaphores are allocated based on maximal number of backends and spinlocks.

Same with spinlocks AFAIK.

You can add your own LWLocks though.
 
3. What is the right way of creation of background worker requiring access to shared memory, i.e. having control structure in main memory?

This is documented and well established.
 
As far as I understand background workers have to be registered either PG_init, either outside Postmaster environment.
If extension requires access to shared memory, then it should be registered in shared_preload_libraries list and should be initialized using shmem_startup hook.

Correct. 

You can use dynamic shmem instead, but there are some issues there IIRC. Petr may have more to say there.
 
Take a look at the BDR code for some examples, and there are some in contrib too I think.

My_shmem_startup is needed because in _PG_init it is not possible to allocate shared memory.

Correct, since it's in early postmaster start.
 
So if I need to allocate some control structure for background workers  in shared memory, then I should do it in My_shmem_startup.

Yes.
 
But I can not register background workers in My_shmem_startup!

Correct. Register static bgworkers in _PG_init. Register dynamic bgworkers later, in a normal backend function or a bgworker main loop.
 
So I have to register background workers in PG_init while control structure for them is not yet ready.

Correct.

They aren't *started* until after shmem init, though.
 
When I have implemented pool of background workers, I solved this problem by proving function which return address of control structure later - when it will be actually allocated.

Beware of EXEC_BACKEND. You can't assume you have shared postmaster memory from fork().

I suggest that you allocate a static shmem array. Pass indexes into it as the arguments to the bgworkers.  Have them look up their index in the array to get their struct pointer.

Read the BDR code to see how this can work; see bdr_perdb.c, bdr_apply.c, etc's bgworker main loops, bdr_supervisor.c and bdr_perdb.c's code for registering dynamic bgworkers, and the _PG_init function's setup of the static supervisor bgworker.

In your case I think you should probably be using dynamic bgworkers for your pool anyway, so you can grow and shrink them as-needed.

But it seems to be some design flaw in BGW, isn' it?

I don't think so. You're registering the worker, saying "when you're ready please start this". You're not starting it.

You can use dynamic bgworkers too. Same deal, you register them and the postmaster starts them in a little while, but you can register them after _PG_init.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Logical replication and multimaster

From
Craig Ringer
Date:
On 3 December 2015 at 19:06, konstantin knizhnik <k.knizhnik@postgrespro.ru> wrote:

On Dec 3, 2015, at 10:34 AM, Craig Ringer wrote:

On 3 December 2015 at 14:54, konstantin knizhnik <k.knizhnik@postgrespro.ru> wrote:

I'd really like to collaborate using pglogical_output if at all possible. Petr's working really hard to get the pglogical downstrem out too, with me helping where I can.
And where I can get  pglogical_output plugin? Sorry, but I can't quickly find reference with Google...

It's been submitted to this CF.



Any tests and comments would be greatly appreciated.


Thank you.
I wonder if there is opposite part of the pipe for pglogical_output - analog of receiver_raw?

It's pglogical, and it's in progress, due to be released at the same time as 9.5. We're holding it a little longer to nail down the user interface a bit better, etc, and because sometimes the real world gets in the way.

The catalogs  and UI are very different to BDR, it's much more extensible/modular, it supports much more flexible topologies, etc... but lots of the core concepts are very similar. So if you go take a look at the BDR code that'll give you a pretty solid idea of how a lot of it works, though BDR has whole subsystems pglogical doesn't (global ddl lock, ddl replication, etc).
 
--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Logical replication and multimaster

From
Craig Ringer
Date:
On 3 December 2015 at 20:39, Simon Riggs <simon@2ndquadrant.com> wrote:
On 30 November 2015 at 17:20, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:
 
But looks like there is not so much sense in having multiple network connection between one pair of nodes.
It seems to be better to have one connection between nodes, but provide parallel execution of received transactions at destination side. But it seems to be also nontrivial. We have now in PostgreSQL some infrastructure for background works, but there is still no abstraction of workers pool and job queue which can provide simple way to organize parallel execution of some jobs. I wonder if somebody is working now on it or we should try to propose our solution?

There are definitely two clear places where additional help would be useful and welcome right now.

Three IMO, in that a re-usable, generic bgworker pool driven by shmem messaging would be quite handy. We'll want something like that when we have transaction interleaving.

I think Konstantin's design is a bit restrictive at the moment; at the least it needs to address sticky dispatch, and it almost certainly needs to be using dynamic bgworkers (and maybe dynamic shmem too) to be flexible. Some thought will be needed to make sure it doesn't rely on !EXEC_BACKEND stuff like passing pointers to fork()ed data from postmaster memory too. But the general idea sounds really useful, and we'll either need that or to use async libpq for concurrent apply.
 
1. Allowing logical decoding to have a "speculative pre-commit data" option, to allow some data to be made available via the decoding api, allowing data to be transferred prior to commit.

Petr, Andres and I tended to refer to that as interleaved transaction streaming. The idea being to send changes from multiple xacts mixed together in the stream, identifed by an xid sent with each message, as we decode them from WAL. Currently we add them to a local reorder buffer and send them only in commit order after commit.

This moves responsibility for xact ordering (and buffering, if necessary) to the downstream. It introduces the possibility that concurrently replayed xacts could deadlock with each other and a few exciting things like that, too, but with the payoff that we can continue to apply small transactions in a timely manner even as we're streaming a big transaction like a COPY.

We could possibly enable interleaving right from the start of the xact, or only once it crosses a certain size threshold. For your purposes Konstantin you'd want to do it right from the start since latency is crucial for you. For pglogical we'd probably want to buffer them a bit and only start streaming if they got big.

This would allow us to reduce the delay that occurs at commit, especially for larger transactions or very low latency requirements for smaller transactions. Some heuristic or user interface would be required to decide whether to and which transactions might make their data available prior to commit.

I imagine we'd have a knob, either global or per-slot, that sets a threshold based on size in bytes of the buffered xact. With 0 allowed as "start immediately".
 
And we would need to send abort messages should the transactions not commit as expected. That would be a patch on logical decoding and is an essentially separate feature to anything currently being developed.

I agree that this is strongly desirable. It'd benefit anyone using logical decoding and would have wide applications.
  
2. Some mechanism/theory to decide when/if to allow parallel apply.

I'm not sure it's as much about allowing it as how to do it.
 
We already have working multi-master that has been contributed to PGDG, so contributing that won't gain us anything.

Namely BDR.
 
There is a lot of code and pglogical is the most useful piece of code to be carved off and reworked for submission.

Starting with the already-published output plugin, with the downstream to come around the release of 9.5.
 
Having a single network connection between nodes would increase efficiency but also increase replication latency, so its not useful in all cases.

If we interleave messages I'm not sure it's too big a problem. Latency would only become an issue there if a big single row (big Datum contents) causes lots of small work to get stuck behind it.

IMO this is a separate issue to be dealt with later.

I think having some kind of message queue between nodes would also help, since there are many cases for which we want to transfer data, not just a replication data flow. For example, consensus on DDL, or MPP query traffic. But that is open to wider debate.

Logical decoding doesn't really define any network protocol at all. It's very flexible, and we can throw almost whatever we want down it. The pglogical_output protocol is extensible enough that we can just add additional messages when we need to, making them opt-in so we don't break clients that don't understand them.

I'm likely to need to do that soon for sequence-advance messages if I can get logical decoding of sequence advance working.

We might want a way to queue those messages at a particular LSN, so we can use them for replay barriers etc and ensure they're crash-safe. Like the generic WAL messages used in BDR and proposed for core. Is that what you're getting at? WAL messages would certainly be nice, but I think we can mostly if not entirely avoid the need for them if we have transaction interleaving and concurrent transaction support.

Somewhat related, I'd quite like to be able to send messages from downstream back to upstream, where they're passed to a hook on the logical decoding plugin. That'd eliminate the need to do a whole bunch of stuff that currently has to be done using direct libpq connections or a second decoding slot in the other direction. Basically send a CopyData packet in the other direction and have its payload passed to a new hook on output plugins.
--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Logical replication and multimaster

From
Petr Jelinek
Date:
On 2015-12-03 14:32, Craig Ringer wrote:
> On 3 December 2015 at 15:27, konstantin knizhnik
> <k.knizhnik@postgrespro.ru <mailto:k.knizhnik@postgrespro.ru>> wrote:
>
>     3. What is the right way of creation of background worker requiring
>     access to shared memory, i.e. having control structure in main memory?
>
>
> This is documented and well established.
>
>     As far as I understand background workers have to be registered
>     either PG_init, either outside Postmaster environment.
>     If extension requires access to shared memory, then it should be
>     registered in shared_preload_libraries list and should be
>     initialized using shmem_startup hook.
>
>
> Correct.
>
> You can use dynamic shmem instead, but there are some issues there IIRC.
> Petr may have more to say there.
> Take a look at the BDR code for some examples, and there are some in
> contrib too I think.
>

If you have your own flock of dynamic workers that you manage yourself, 
it's probably easier to use dynamic shared memory. You can see some 
examples in the tests and also in the parallel query code for how to do 
it. The only real issue we faced with using dynamic shared memory was 
that we needed to do IPC from normal backends and that gets complicated 
when you don't have the worker info in the normal shmem.


The registration timing and working with normal shmem is actually not a 
problem. Just register shmem start hook in _PG_init and if you are 
registering any bgworkers there as well make sure you set bgw_start_time 
correctly (usually what you want is BgWorkerStart_RecoveryFinished). 
Then you'll have the shmem hook called before the bgworker is actually 
started.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: Logical replication and multimaster

From
Craig Ringer
Date:

On 3 December 2015 at 20:39, Simon Riggs <simon@2ndquadrant.com> wrote:
On 30 November 2015 at 17:20, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:
 
But looks like there is not so much sense in having multiple network connection between one pair of nodes.
It seems to be better to have one connection between nodes, but provide parallel execution of received transactions at destination side. But it seems to be also nontrivial. We have now in PostgreSQL some infrastructure for background works, but there is still no abstraction of workers pool and job queue which can provide simple way to organize parallel execution of some jobs. I wonder if somebody is working now on it or we should try to propose our solution?

There are definitely two clear places where additional help would be useful and welcome right now.

1. Allowing logical decoding to have a "speculative pre-commit data" option, to allow some data to be made available via the decoding api, allowing data to be transferred prior to commit.

Something relevant I ran into re this:

in reorderbuffer.c, on ReorderBufferCommit:

   * We currently can only decode a transaction's contents in when their commit
   * record is read because that's currently the only place where we know about
   * cache invalidations. Thus, once a toplevel commit is read, we iterate over
   * the top and subtransactions (using a k-way merge) and replay the changes in
   * lsn order.

I haven't dug into the implications particularly as I'm chasing something else, but want to note it on the thread. Here be dragons when it comes to transaction streaming.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Logical replication and multimaster

From
Konstantin Knizhnik
Date:
Hi,

I have integrated pglogical_output in multimaster, using bdr_apply from BDR as template for implementation of receiver part.
The time of insert is reduced almost 10 times comparing with logical replication based on decoder_raw/receiver_raw plugins which performs logical replication using SQL statements. But unfortunately time of updates is almost not changed.
It is expected result because I didn't see any functions related with SQL parsing/preparing in profile.
Now in both cases profile is similar:
 
  4.62%  postgres                            [.] HeapTupleSatisfiesMVCC
  2.99%  postgres                            [.] heapgetpage
  2.10%  postgres                            [.] hash_search_with_hash_value
  1.86%  postgres                            [.] ExecProject
  1.80%  postgres                            [.] heap_getnext
  1.79%  postgres                            [.] PgXidInMVCCSnapshot

By the way, you asked about comments concerning pglogical_output. I have one: most of pglogical protocol functions have "PGLogicalOutputData *data" parameter. There are few exceptions:
   
    write_startup_message_fn, pglogical_write_origin_fn, pglogical_write_rel_fn

PGLogicalOutputData is the only way to pass protocol specific data, using "PGLogicalProtoAPI *api" field.
This field is assigned by  pglogical_init_api() function. And I can extend this PGLogicalProtoAPI structure by adding some protocol specific fields.
For example, this is how it is done now for multimaster:

typedef struct PGLogicalProtoMM
{
    PGLogicalProtoAPI api;
    bool isLocal; /* mark transaction as local */
} PGLogicalProtoMM;

PGLogicalProtoAPI *
pglogical_init_api(PGLogicalProtoType typ)
{
    PGLogicalProtoMM* pmm = palloc0(sizeof(PGLogicalProtoMM));
    PGLogicalProtoAPI* res = &pmm->api;
    pmm->isLocal = false;
    res->write_rel = pglogical_write_rel;
    res->write_begin = pglogical_write_begin;
    res->write_commit = pglogical_write_commit;
    res->write_insert = pglogical_write_insert;
    res->write_update = pglogical_write_update;
    res->write_delete = pglogical_write_delete;
    res->write_startup_message = write_startup_message;
    return res;
}

But I have to add "PGLogicalOutputData *data"  parameter to pglogical_write_rel_fn function.
Di you think that it will be better to pass this parameter to all functions?

May be it is not intended way of passing custom data to this functions...
Certainly it is possible to use static variables for this purpose.
But I think that passing user specific data through PGLogicalOutputData is safer and more flexible solution.



On 12/03/2015 04:53 PM, Craig Ringer wrote:
On 3 December 2015 at 19:06, konstantin knizhnik <k.knizhnik@postgrespro.ru> wrote:

On Dec 3, 2015, at 10:34 AM, Craig Ringer wrote:

On 3 December 2015 at 14:54, konstantin knizhnik <k.knizhnik@postgrespro.ru> wrote:

I'd really like to collaborate using pglogical_output if at all possible. Petr's working really hard to get the pglogical downstrem out too, with me helping where I can.
And where I can get  pglogical_output plugin? Sorry, but I can't quickly find reference with Google...

It's been submitted to this CF.



Any tests and comments would be greatly appreciated.


Thank you.
I wonder if there is opposite part of the pipe for pglogical_output - analog of receiver_raw?

It's pglogical, and it's in progress, due to be released at the same time as 9.5. We're holding it a little longer to nail down the user interface a bit better, etc, and because sometimes the real world gets in the way.

The catalogs  and UI are very different to BDR, it's much more extensible/modular, it supports much more flexible topologies, etc... but lots of the core concepts are very similar. So if you go take a look at the BDR code that'll give you a pretty solid idea of how a lot of it works, though BDR has whole subsystems pglogical doesn't (global ddl lock, ddl replication, etc).
 
--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Logical replication and multimaster

From
Craig Ringer
Date:
On 7 December 2015 at 01:39, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:

I have integrated pglogical_output in multimaster

Excellent.

I just pushed a change to pglogical_output that exposes the row contents (and the rest of the reorder change buffer contents) to hooks that want it, by the way.
 
using bdr_apply from BDR as template for implementation of receiver part.

Yep, that'll tide you over. We're working hard on getting the downstream part ready and you'll find it more flexible.
 
The time of insert is reduced almost 10 times comparing with logical replication based on decoder_raw/receiver_raw plugins which performs logical replication using SQL statements. But unfortunately time of updates is almost not changed.

That's not too surprising, given that you'll have significant overheads for checking if keys are present when doing updates.

This field is assigned by  pglogical_init_api() function. And I can extend this PGLogicalProtoAPI structure by adding some protocol specific fields.

Yep, that's the idea.
 
typedef struct PGLogicalProtoMM
{
    PGLogicalProtoAPI api;
    bool isLocal; /* mark transaction as local */
} PGLogicalProtoMM;

I'm curious about what you're using the 'isLocal' field for.

For MM you should only need to examine the replication origin assigned to the transaction to determine whether you're going to forward it or not.

Were you not able to achieve what you wanted with a hook? If not, then we might need another hook. Could you explain what it's for in more detail?

What I suggest is: have your downstream client install a pglogical_output hook for the transaction filter hook. There, examine the replication origin passed to the hook. If you want to forward locally originated xacts only (such as for mesh multimaster) you can just filter out everything where the origin is not InvalidRepOriginId. There are example hooks in contrib/pglogical_output_plhooks .

There'll be a simple MM example using filter hooks in the pglogical downstream btw and we're working hard to get that out.
 
But I have to add "PGLogicalOutputData *data"  parameter to pglogical_write_rel_fn function.
Do you think that it will be better to pass this parameter to all functions?

Yes, I agree that it should be passed to the API for the output protocol. It's pretty harmless. Please feel free to send a pull req.

Note that we haven't made that pluggable from the outside though; there's no way to load a new protocol distributed separately from pglogical_output. The idea is really to make sure that between the binary protocol and json protocol we meet the reasonably expected set of use cases and don't need pluggable protocols. Perhaps that's over-optimistic, but we've already got and output plugin that has plug-in hooks, a plugin for a plugin. Do we need another? Also, if we allow dynamic loading of new protocols then that means we'll have a much harder time changing the protocol implementation API later, so it's not something I'm keen to do. Also, to make it secure to allow users to specify the protocol we'd have to make protocols implement an extension with a C function in pg_proc to return its API struct, like we do for hooks. So there'd be more hoop-jumping required to figure out how to talk to the client.

If possible I'd like to find any oversights and omissions in the current protocol and its APIs to meet future use cases without having to introduce protocol plugins for an output plugin.

May be it is not intended way of passing custom data to this functions...

Yeah, we weren't really thinking of the protocol API as intended to be pluggable and extensible. If you define new protocols you have to change the rest of the output plugin code anyway.

Lets look at what protocol changes are needed to address your use case and see whether it's necessary to take the step of making the protocol fully pluggable or not.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Logical replication and multimaster

From
Craig Ringer
Date:

There are definitely two clear places where additional help would be useful and welcome right now.

Here's a shortlist of replication-related open items I put together earlier. These are all independent things it'd be helpful to have when working with logical replication in PostgreSQL. There are a lot of medium-sized pieces of independent work still to be done to make some things work well.

For example, you can't have multimaster logical replication with physical failover nodes for each MM node unless we replicate slot create/drop/advance over physical replication and copy them with pg_basebackup.

Handling big transactions better:

* logical decoding interleaved transaction streaming as discussed earlier

Automatic DDL replication:

* DDL deparse extension (Álvaro started some work on this)
* Better way to link the pg_temp_nnn tables generated during table rewrite to the original table in decoding
* logical decoding support for prepared xacts (before commit prepared). Useful for DDL replication, other consensus operations.

Using physical replication/PITR with logical replication:

* replication slots replicated to physical standbys ("Failover slots")
* pg_basebackup should copy slots
* slots should follow timeline changes

Failover between logical replicas:

* logical decoding support for sequences
* logical decoding support for slot create/drop/advance
* Separate slot WAL retention from confirmed commit point so you can say "I've replayed up to 1/0000AB but you should keep from 1/000001". Needed in async MM to cope with node loss properly. Will write it up separately.

Multimaster:

* Sequence Access Method

Other logical decoding enhancements:

* Support exporting a new snapshot from an existing logical slot. Useful for COPYing new tables not previously replicated when added to a replication set, for resync'ing tables, comparing tables, etc.
* WAL messages. Useful for voting, replay confirmation, etc. Rendered largely unnecessary by xact interleaved streaming.

Misc:

* An API to enumerate currently registered bgworkers
* A way to securely make a libpq connection from a bgworker without messing with passwords etc. Generate one-time cookies, sometihng like that.
* (unimportant but nice API fix): BGW_NO_RESTART_ON_CRASH

  

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Logical replication and multimaster

From
Robert Haas
Date:
On Sun, Dec 6, 2015 at 10:24 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
> * An API to enumerate currently registered bgworkers
> * A way to securely make a libpq connection from a bgworker without messing
> with passwords etc. Generate one-time cookies, sometihng like that.
> * (unimportant but nice API fix): BGW_NO_RESTART_ON_CRASH

Why would you have the bgworker connect to the database via TCP
instead of just doing whatever it wants to do directly?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Logical replication and multimaster

From
Andres Freund
Date:
On 2015-12-03 09:54:23 +0300, konstantin knizhnik wrote:
> But right now performance of Multimaster is not limited by logical replication protocol - if I remove DTM and use
asynchronousreplication (lightweight version of BDR:)
 
> then I get 38k TPS instead of 12k.

My guess is that that's to a large degree because BDR 'batches' WAL
flushes / fsyncs over several connections. As the data is only applied
in one connection, whereas the primary can use multiple backends, it is
important no to constantly flush, as that's synchronous.

What I did for bdr was to register the 'desired' flush position whenever
replaying a commit (c.f. dlist_push_tail(&bdr_lsn_association,
&flushpos->node); in process_remote_commit()) and whenever feedback is
sent figure out how far the WAL actually has been flushed
(c.f. bdr_get_flush_position()).

Now that cannot trivially be done with 2PC, but it doesn't look all that
hard to change the 2PC API to allow at least some batching of the
fsyncs.

Greetings,

Andres Freund



Re: Logical replication and multimaster

From
Craig Ringer
Date:
On 10 December 2015 at 03:19, Robert Haas <robertmhaas@gmail.com> wrote:
On Sun, Dec 6, 2015 at 10:24 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
 
> * A way to securely make a libpq connection from a bgworker without messing
> with passwords etc. Generate one-time cookies, sometihng like that.

Why would you have the bgworker connect to the database via TCP
instead of just doing whatever it wants to do directly?


pg_dump and pg_restore, mainly, for copying the initial database state.

PostgreSQL doesn't have SQL-level function equivalents, nor pg_get_tabledef() etc, and there's been strong opposition to adding anything of the sort when it's been raised before. We could read a dump in via pg_restore's text conversion and run the appropriate queries over the SPI, doing the query splitting, COPY parsing and loading, etc ourselves in a bgworker. It'd be ugly and duplicate a lot, but it'd work. However, it wouldn't be possible to do restores in parallel that way, and that's necessary to get good restore performance on big DBs. For that we'd also basically rewrite pg_restore's parallel functionality using a bgworker pool. 

The alternative is a massive rewrite of pg_dump and pg_restore to allow them to be used as libraries, and let them use either libpq or the SPI for queries, presumably via some level of abstraction layer. As well as further abtraction for pipelining parallel work. Not very practical, and IIRC whenever library-ifing pg_dump and pg_restore has been discussed before it's been pretty firmly rejected.

Also, parallelism at apply time. There are two ways to do apply work in parallel - a pool of bgworkers that each use the SPI, or using regular backends managing async libpq connections. At this point I think Konstantin's approach, with a bgworker pool that processes a work queue, is probably better for this, and want to explore making that a re-usable extension for 9.5 and possibly a core part of 9.6 or 9.7.

So it's mainly for pg_restore.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Logical replication and multimaster

From
Andres Freund
Date:
On 2015-12-11 18:12:55 +0800, Craig Ringer wrote:
> On 10 December 2015 at 03:19, Robert Haas <robertmhaas@gmail.com> wrote:
> > On Sun, Dec 6, 2015 at 10:24 PM, Craig Ringer <craig@2ndquadrant.com>
> > wrote:
> > > * A way to securely make a libpq connection from a bgworker without
> > messing
> > > with passwords etc. Generate one-time cookies, sometihng like that.
> >
> > Why would you have the bgworker connect to the database via TCP
> > instead of just doing whatever it wants to do directly?

> pg_dump and pg_restore, mainly, for copying the initial database state.

Well, you don't want to necessarily directly connect from the bgworker,
but from processes started from a bgworker. I guess that's where a good
bit of the Robert's confusion originated.



Re: Logical replication and multimaster

From
Robert Haas
Date:
On Fri, Dec 11, 2015 at 5:16 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-12-11 18:12:55 +0800, Craig Ringer wrote:
>> On 10 December 2015 at 03:19, Robert Haas <robertmhaas@gmail.com> wrote:
>> > On Sun, Dec 6, 2015 at 10:24 PM, Craig Ringer <craig@2ndquadrant.com>
>> > wrote:
>> > > * A way to securely make a libpq connection from a bgworker without
>> > messing
>> > > with passwords etc. Generate one-time cookies, sometihng like that.
>> >
>> > Why would you have the bgworker connect to the database via TCP
>> > instead of just doing whatever it wants to do directly?
>
>> pg_dump and pg_restore, mainly, for copying the initial database state.
>
> Well, you don't want to necessarily directly connect from the bgworker,
> but from processes started from a bgworker. I guess that's where a good
> bit of the Robert's confusion originated.

That's part of it, yeah.  I'm a little scared of this design.  I mean,
I understand now why Craig wants to do this (thanks for explaining,
Craig!), but it seems like it's going to have a lot of the same
reliability problems that pg_upgrade does.  I'm not saying there's a
better way to get the functionality, but it's pretty obvious that
depending on tools other than the server itself, and in particular
pg_dump, vastly increases the failure surface area.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Logical replication and multimaster

From
Petr Jelinek
Date:
On 2015-12-11 19:24, Robert Haas wrote:
> On Fri, Dec 11, 2015 at 5:16 AM, Andres Freund <andres@anarazel.de> wrote:
>> On 2015-12-11 18:12:55 +0800, Craig Ringer wrote:
>>> On 10 December 2015 at 03:19, Robert Haas <robertmhaas@gmail.com> wrote:
>>>> On Sun, Dec 6, 2015 at 10:24 PM, Craig Ringer <craig@2ndquadrant.com>
>>>> wrote:
>>>>> * A way to securely make a libpq connection from a bgworker without
>>>> messing
>>>>> with passwords etc. Generate one-time cookies, sometihng like that.
>>>>
>>>> Why would you have the bgworker connect to the database via TCP
>>>> instead of just doing whatever it wants to do directly?
>>
>>> pg_dump and pg_restore, mainly, for copying the initial database state.
>>
>> Well, you don't want to necessarily directly connect from the bgworker,
>> but from processes started from a bgworker. I guess that's where a good
>> bit of the Robert's confusion originated.
>
> That's part of it, yeah.  I'm a little scared of this design.  I mean,
> I understand now why Craig wants to do this (thanks for explaining,
> Craig!), but it seems like it's going to have a lot of the same
> reliability problems that pg_upgrade does.  I'm not saying there's a
> better way to get the functionality, but it's pretty obvious that
> depending on tools other than the server itself, and in particular
> pg_dump, vastly increases the failure surface area.
>

Well, it's better than trying to write completely new catalogs dump tool 
for this. As Craig said, it would be best if pg_dump functionality was 
moved to functions in a backend, but that's probably not gonna happen 
tomorrow.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: Logical replication and multimaster

From
Craig Ringer
Date:
On 12 December 2015 at 02:24, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Dec 11, 2015 at 5:16 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-12-11 18:12:55 +0800, Craig Ringer wrote:
>> On 10 December 2015 at 03:19, Robert Haas <robertmhaas@gmail.com> wrote:
>> > On Sun, Dec 6, 2015 at 10:24 PM, Craig Ringer <craig@2ndquadrant.com>
>> > wrote:
>> > > * A way to securely make a libpq connection from a bgworker without
>> > messing
>> > > with passwords etc. Generate one-time cookies, sometihng like that.
>> >
>> > Why would you have the bgworker connect to the database via TCP
>> > instead of just doing whatever it wants to do directly?
>
>> pg_dump and pg_restore, mainly, for copying the initial database state.
>
> Well, you don't want to necessarily directly connect from the bgworker,
> but from processes started from a bgworker. I guess that's where a good
> bit of the Robert's confusion originated.

That's part of it, yeah.  I'm a little scared of this design.  I mean,
I understand now why Craig wants to do this (thanks for explaining,
Craig!), but it seems like it's going to have a lot of the same
reliability problems that pg_upgrade does.

Yes, and more.

Especially when dealing with multiple upstream servers, etc.

It's not very nice. I would very much prefer to have a better way to achieve the initial data sync, but at present I don't think there is any better approach that's even remotely practical.

I'm not saying there's a better way to get the functionality

Yup. That's the problem.
 
but it's pretty obvious that depending on tools other than the server itself, and in particular
pg_dump, vastly increases the failure surface area.

 It's not too bad to find pg_dump, though we landed up not being able to re-use find_other_exec for various reasons I'll have to try to dig out of the cruftier corners of my memory.  It has a fairly sane interface too.

Things get hairy when you want to do things like "give me all the upstream's non-table objects, then give me [this set of table definitions]"... then you go and sync the data from an exported snapshot using COPY, then finish up by restoring the constraints for the set of tables you dumped.

Being able to access pg_dump and pg_restore's dependency resolution logic, object dumping routines, etc from regular SQL and from the SPI would be wonderful.

I believe the main complaints about doing that when it was discussed in the past were issues with downgrading. A pg_get_tabledef(...) in 9.6 might emit keywords etc that a 9.2 server wouldn't understand, and the way we currently solve this is to require that you run 9.2's pg_dump against the 9.6 server to get a 9.2-compatible dump. So if we had pg_get_blahdef functions we'd still need external versions, so why bother having them?

The alternative is to have all the get_blahdef functions accept a param for server version compatibility, which would work but burden future servers with knowledge about older versions' features and corresponding code cruft for some extended period of time.

So it's gone nowhere to date.

For that matter it's not clear that pg_get_blahdef functions would be the right solution, but I can't see directly poking around in the catalogs and basically re-implementing pg_dump being OK either. So what else could we do?


--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Logical replication and multimaster

From
Simon Riggs
Date:
On 6 December 2015 at 17:39, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:
 
I have integrated pglogical_output in multimaster, using bdr_apply from BDR as template for implementation of receiver part. 

I didn't see the patch for this anywhere. Where is the code? 

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Logical replication and multimaster

From
Andres Freund
Date:
On December 13, 2015 10:19:07 AM CET, Simon Riggs <simon@2ndQuadrant.com> wrote:
>I didn't see the patch for this anywhere. Where is the code?


Where I'd the code for all of pg logical?

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.



Re: Logical replication and multimaster

From
Simon Riggs
Date:
On 13 December 2015 at 11:02, Andres Freund <andres@anarazel.de> wrote:
On December 13, 2015 10:19:07 AM CET, Simon Riggs <simon@2ndQuadrant.com> wrote:
>I didn't see the patch for this anywhere. Where is the code?

Where I'd the code for all of pg logical?

Thanks for asking, perhaps our plans weren't public enough. pglogical has already been announced as open source, under the postgres licence and that it will be a submission to core PostgreSQL, just as BDR was. pglogical is in development/test right now and will be released when its ready, which hopefully will be "soon", aiming for 9.6.

Thanks also for the opportunity to let me ask what your plans are regarding contributing to core? I couldn't make it to SF recently because of a funding meeting, but I heard something like your company will release something as open source sometime in 2016. Could you clarify what that will be, when it will be, what licence it is under and if it is a contribution to core also? Is that something you're working on also?

I don't know the status of Konstantin's work, so same question for him also.

It will be useful to help work out what parts need work on once pglogical is done.

For me, submission to core means both that it is postgres licenced and that the copyright is novated to PGDG, allowing it to be included within PG core.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Logical replication and multimaster

From
Andres Freund
Date:
On 2015-12-13 11:39:32 +0000, Simon Riggs wrote:
> On 13 December 2015 at 11:02, Andres Freund <andres@anarazel.de> wrote:
> > On December 13, 2015 10:19:07 AM CET, Simon Riggs <simon@2ndQuadrant.com>
> > wrote:
> > >I didn't see the patch for this anywhere. Where is the code?
> >
> > Where is the code for all of pg logical?
> >
> 
> Thanks for asking, perhaps our plans weren't public enough. pglogical has
> already been announced as open source, under the postgres licence and that
> it will be a submission to core PostgreSQL, just as BDR was. pglogical is
> in development/test right now and will be released when its ready, which
> hopefully will be "soon", aiming for 9.6.

Well, at the moment not making it public is obviously blocking other
people, and not doing open design discussions publically seems to make
it rather unlikely that it'll get accepted close to as-is anyway. It's
constantly referred to in discussions, and it guided the design of the
submitted output plugin.


> Thanks also for the opportunity to let me ask what your plans are regarding
> contributing to core?

Uh, I am? Stuff like working on upsert, grouping sets et al, was all on
Citus' time. I've been busy with something else for the last 2-3 months.


> I couldn't make it to SF recently because of a funding meeting, but I
> heard something like your company will release something as open
> source sometime in 2016. Could you clarify what that will be, when it
> will be, what licence it is under and if it is a contribution to core
> also? Is that something you're working on also?

I don't see how that belongs to this thread, it's unrelated to
replication.

Either way, the license is yet to be determined, and it'll be Q1
2016. Yes, I've worked on open sourcing it.


Greetings,

Andres Freund



Re: Logical replication and multimaster

From
Simon Riggs
Date:
On 13 December 2015 at 11:53, Andres Freund <andres@anarazel.de> wrote:
 
> Thanks for asking, perhaps our plans weren't public enough. pglogical has
> already been announced as open source, under the postgres licence and that
> it will be a submission to core PostgreSQL, just as BDR was. pglogical is
> in development/test right now and will be released when its ready, which
> hopefully will be "soon", aiming for 9.6.

Well, at the moment not making it public is obviously blocking other
people, and

What other people are being blocked? What contribution they are making to PostgreSQL core is being delayed? What is the nature of this delay?
 
not doing open design discussions publically seems to make
it rather unlikely that it'll get accepted close to as-is anyway. It's
constantly referred to in discussions, and it guided the design of the
submitted output plugin.

It's a shame you think that, but posting incoherent proposals just wastes everybody's time.

The focus has been on making the internals more generic, unpicking many of the parts of BDR that were too specialized. The output plugin has not been guided by the needs of pglogical at all, its been designed to be way more open than BDR was.

The UI is a fairly thin layer on top of that and can be recoded without too much work, but I don't think its a surprising design. It's been quite difficult to cater for the many complex and sometimes conflicting requirements and that has only recently come to together into a coherent form by my hand. Whether the UI makes sense remains to be seen, but I think building it is an essential part of the evaluation of whether it is actually a good UI.

Submission to core implies that changes are possible and discussion is welcome. I expect that to happen. If there were any truly contentious parts they would have been discussed ahead of time. I see no reason to believe that pglogical would not or could not be accepted into 9.6.
 
> Thanks also for the opportunity to let me ask what your plans are regarding
> contributing to core?

Uh, I am? Stuff like working on upsert, grouping sets et al, was all on
Citus' time. I've been busy with something else for the last 2-3 months.

Good, thanks; you misread that and I wasn't questioning it. pglogical developers have a day job too.
 
> I couldn't make it to SF recently because of a funding meeting, but I
> heard something like your company will release something as open
> source sometime in 2016. Could you clarify what that will be, when it
> will be, what licence it is under and if it is a contribution to core
> also? Is that something you're working on also?

I don't see how that belongs to this thread, it's unrelated to
replication.

I assumed your interest in pglogical meant there was some connection.
 
Either way, the license is yet to be determined, and it'll be Q1
2016. Yes, I've worked on open sourcing it.

If its under the Postgres licence and submitted to core, as is BDR, you may find many people interested in working on it also.

Initial development of major features is IMHO best done by small groups of dedicated developers. That has got nothing at all to do with what happens to the code in the longer term.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Logical replication and multimaster

From
Konstantin Knizhnik
Date:
On 12/13/2015 12:19 PM, Simon Riggs wrote:
On 6 December 2015 at 17:39, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:
 
I have integrated pglogical_output in multimaster, using bdr_apply from BDR as template for implementation of receiver part. 

I didn't see the patch for this anywhere. Where is the code?

I am sorry,  the code is now in our internal gitlab repository.
We have published pg_dtm and pg_tsdtm as separate repositories at github.com/postgrespro.
Them include source of plugin itself and patch to PostgreSQL core.
But we find it is very inconvenient, because we also have to extend DTM API, adding new features as deadlock detection...
So we are going to publish at github.com/postgrespro our branch of PostgreSQL where pg_dtm, pg_tsdtm and multimaster will be available as extensions in contrib directory.  It will be available at Monday.


--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Logical replication and multimaster

From
David Fetter
Date:
On Sat, Dec 12, 2015 at 06:48:58PM +0800, Craig Ringer wrote:
> Being able to access pg_dump and pg_restore's dependency resolution logic,
> object dumping routines, etc from regular SQL and from the SPI would be
> wonderful.

As I understand it, pushing these into a library has been proposed but
not rejected.  That it hasn't happened yet is mostly about the lack of
tuits (the round ones) to rewrite the functionality as libraries and
refactor pg_dump/pg_restore to use only calls to same.  As usual, it's
less about writing the code and more about the enormous amount of
testing any such a refactor would entail.

> I believe the main complaints about doing that when it was discussed in the
> past were issues with downgrading. A pg_get_tabledef(...) in 9.6 might emit
> keywords etc that a 9.2 server wouldn't understand, and the way we
> currently solve this is to require that you run 9.2's pg_dump against the
> 9.6 server to get a 9.2-compatible dump. So if we had pg_get_blahdef
> functions we'd still need external versions, so why bother having them?

You have made a persuasive case that major version downgrading is not
a problem we need to solve on the first go.

> The alternative is to have all the get_blahdef functions accept a param for
> server version compatibility, which would work but burden future servers
> with knowledge about older versions' features and corresponding code cruft
> for some extended period of time.
> 
> So it's gone nowhere to date.

I believe that refactoring much of pg_dump's functionality for the
current version of the server into SQL-accessible functions and making
pg_dump use only those functions is achievable with available
resources.

Such a refactor need not be all-or-nothing.  For example, the
dependency resolution stuff is a first step that appears to be worth
doing by itself even if the effort then pauses, possibly for some
time.

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: Logical replication and multimaster

From
Konstantin Knizhnik
Date:
I have updated DTM page at PoastgreSQL WiKi adding  information about multimaster.
Also we have created repository at github  with our version of PostgreSQL and DTM extensions:
multimaster, pg_dtm, pg_tsdtm, bdr (sorry for plagiarism, it is just a toy, lightweight version of multimaster with asynchronous replication, used  to compare performance).



On 13.12.2015 15:46, Konstantin Knizhnik wrote:
On 12/13/2015 12:19 PM, Simon Riggs wrote:
On 6 December 2015 at 17:39, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:
 
I have integrated pglogical_output in multimaster, using bdr_apply from BDR as template for implementation of receiver part. 

I didn't see the patch for this anywhere. Where is the code?

I am sorry,  the code is now in our internal gitlab repository.
We have published pg_dtm and pg_tsdtm as separate repositories at github.com/postgrespro.
Them include source of plugin itself and patch to PostgreSQL core.
But we find it is very inconvenient, because we also have to extend DTM API, adding new features as deadlock detection...
So we are going to publish at github.com/postgrespro our branch of PostgreSQL where pg_dtm, pg_tsdtm and multimaster will be available as extensions in contrib directory.  It will be available at Monday.


--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Logical replication and multimaster

From
Jim Nasby
Date:
On 12/13/15 7:37 AM, David Fetter wrote:
> As I understand it, pushing these into a library has been proposed but
> not rejected.  That it hasn't happened yet is mostly about the lack of
> tuits (the round ones) to rewrite the functionality as libraries and
> refactor pg_dump/pg_restore to use only calls to same.  As usual, it's
> less about writing the code and more about the enormous amount of
> testing any such a refactor would entail.

My understanding as well. IIRC Jon Erdman brought this question up a 
couple years ago and the response was "It'd probably be accepted, it's 
just that no one has done the work."

> I believe that refactoring much of pg_dump's functionality for the
> current version of the server into SQL-accessible functions and making
> pg_dump use only those functions is achievable with available
> resources.
>
> Such a refactor need not be all-or-nothing.  For example, the
> dependency resolution stuff is a first step that appears to be worth
> doing by itself even if the effort then pauses, possibly for some
> time.

If someone wanted to spend time on this, I suspect it'd be worth looking 
at how bad some of the backward compatibility issues would be if done in 
the server. Maybe they wouldn't be that bad. I suspect the audience for 
this code would be much larger if it was in the server as opposed to a C 
library.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Logical replication and multimaster

From
Jon Erdman
Date:
On Tue, 15 Dec 2015 21:48:52 -0600
Jim Nasby <Jim.Nasby@BlueTreble.com> wrote:

> On 12/13/15 7:37 AM, David Fetter wrote:
> > As I understand it, pushing these into a library has been proposed but
> > not rejected.  That it hasn't happened yet is mostly about the lack of
> > tuits (the round ones) to rewrite the functionality as libraries and
> > refactor pg_dump/pg_restore to use only calls to same.  As usual, it's
> > less about writing the code and more about the enormous amount of
> > testing any such a refactor would entail.
>
> My understanding as well. IIRC Jon Erdman brought this question up a
> couple years ago and the response was "It'd probably be accepted, it's
> just that no one has done the work."

To be clear, if I understand what you are referring to, my name pops up in this discussion because in Amsterdam around
thetime that FDWs were originally launched, I had a "brilliant" idea that a great candidate for an FDW would be one
thatcould read from a custom dump (-Fc) file and expose it as though it contained regular tables, so you could restore
oneerroneously deleted row from a 2TB dump without loading the whole thing, or the whole table in question. 

On the face of it this seemed relatively simple since a custom dump has a TOC and all the requisite goodies to make
thisdoable, plus the code exists to interpret that (for restoring just one table out of a dump file) and all that was
neededwas the "glue" to hook it into FDW. 

Initially the reaction (from Magnus if I'm not mistaken) was "that's stupid, who would want that", but later Dave Page
waswholly on board with it. 

At the next pgcon I spoke up on the same subject at the end of a talk about FDWs where Tom was in attendance, and all
agreedmy idea had merit...however, unexpectedly they (including Tom) agreed that trying to turn that part of our
commandline functionality into a library (the proper solution) was more effort than it was worth, and that if I wanted
totry it I should just cut and paste the relevant code out of pg_dump and into my FDW, rather than trying to refactor
andshare said code in a .so. [I was *VERY* surprised by this!] 

No one said it couldn't be done, but even the "wise men on the mount" conceded that it was such a huge undertaking that
itwas not worth the effort, and duplicating and subsequently maintaining said duplicated code was the better part of
valor.
> > I believe that refactoring much of pg_dump's functionality for the
> > current version of the server into SQL-accessible functions and making
> > pg_dump use only those functions is achievable with available
> > resources.
> >
> > Such a refactor need not be all-or-nothing.  For example, the
> > dependency resolution stuff is a first step that appears to be worth
> > doing by itself even if the effort then pauses, possibly for some
> > time.
>
> If someone wanted to spend time on this, I suspect it'd be worth looking
> at how bad some of the backward compatibility issues would be if done in
> the server. Maybe they wouldn't be that bad. I suspect the audience for
> this code would be much larger if it was in the server as opposed to a C
> library.
> --
> Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
> Experts in Analytics, Data Architecture and PostgreSQL
> Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Jon Erdman (aka StuckMojo)   PostgreSQL Zealot

Re: [HACKERS] Logical replication and multimaster

From
Jon Erdman
Date:
Well,

Given my earlier message about Apple wanting to pay me for writing 
patches now, maybe I can revisit this idea.

For background: I brought up the idea of an FDW that could read custom 
dump files and expose them as tables so that you could grab just a 
single record (or of course more) from a custom dump file without having 
to load the whole thing up, if you were stuck reaching into a backup to 
get at accidentally deleted tables, rows, etc.. The stopper, which was 
communicated to me by Tom at the following pgcon was that the code for 
parsing custom dumps is duplicated in pg_dump only, and would need to be 
duplicated into the server for the FDW, or broken out into a library.

And for posterity, Dave Page said that was a stupid idea, while Magnus 
said that it sounded useful. And IIRC Bruce and Robert H said it was 
doable, just a good deal of work on the refactor needed.

This convo went down at the Amsterdam conf where I spoke about using 
writeable LVM snapshots to expose each developer a copy of prod to 
noodle on, without having to actually make a full copy for each dev.

Added trivia: I gave the talk with a can of Heineken in my hand at the 
podium, and my lightning talk had the work F(&king Cool in the title ;)

That was also when I bought my plush Slony which was supposedly the very 
last one. (turns out they made more)
--
Jon Erdman (aka StuckMojo)
     PostgreSQL Zealot

On 12/15/15 9:48 PM, Jim Nasby wrote:
> On 12/13/15 7:37 AM, David Fetter wrote:
>> As I understand it, pushing these into a library has been proposed but
>> not rejected.  That it hasn't happened yet is mostly about the lack of
>> tuits (the round ones) to rewrite the functionality as libraries and
>> refactor pg_dump/pg_restore to use only calls to same.  As usual, it's
>> less about writing the code and more about the enormous amount of
>> testing any such a refactor would entail.
> 
> My understanding as well. IIRC Jon Erdman brought this question up a 
> couple years ago and the response was "It'd probably be accepted, it's 
> just that no one has done the work."
> 
>> I believe that refactoring much of pg_dump's functionality for the
>> current version of the server into SQL-accessible functions and making
>> pg_dump use only those functions is achievable with available
>> resources.
>>
>> Such a refactor need not be all-or-nothing.  For example, the
>> dependency resolution stuff is a first step that appears to be worth
>> doing by itself even if the effort then pauses, possibly for some
>> time.
> 
> If someone wanted to spend time on this, I suspect it'd be worth looking 
> at how bad some of the backward compatibility issues would be if done in 
> the server. Maybe they wouldn't be that bad. I suspect the audience for 
> this code would be much larger if it was in the server as opposed to a C 
> library.




Re: [HACKERS] Logical replication and multimaster

From
Jon Erdman
Date:
Oops,

Looking at my old message from 2015 on this subject, it was Magnus who 
pish-poshed it, and Page who liked it. Don't want to throw shaded at the 
wrong person ;)
--
Jon Erdman (aka StuckMojo)
     PostgreSQL Zealot

On 10/5/23 12:14 AM, Jon Erdman wrote:
> 
> Well,
> 
> Given my earlier message about Apple wanting to pay me for writing 
> patches now, maybe I can revisit this idea.
> 
> For background: I brought up the idea of an FDW that could read custom 
> dump files and expose them as tables so that you could grab just a 
> single record (or of course more) from a custom dump file without having 
> to load the whole thing up, if you were stuck reaching into a backup to 
> get at accidentally deleted tables, rows, etc.. The stopper, which was 
> communicated to me by Tom at the following pgcon was that the code for 
> parsing custom dumps is duplicated in pg_dump only, and would need to be 
> duplicated into the server for the FDW, or broken out into a library.
> 
> And for posterity, Dave Page said that was a stupid idea, while Magnus 
> said that it sounded useful. And IIRC Bruce and Robert H said it was 
> doable, just a good deal of work on the refactor needed.
> 
> This convo went down at the Amsterdam conf where I spoke about using 
> writeable LVM snapshots to expose each developer a copy of prod to 
> noodle on, without having to actually make a full copy for each dev.
> 
> Added trivia: I gave the talk with a can of Heineken in my hand at the 
> podium, and my lightning talk had the work F(&king Cool in the title ;)
> 
> That was also when I bought my plush Slony which was supposedly the very 
> last one. (turns out they made more)
> -- 
> Jon Erdman (aka StuckMojo)
>      PostgreSQL Zealot
> 
> On 12/15/15 9:48 PM, Jim Nasby wrote:
>> On 12/13/15 7:37 AM, David Fetter wrote:
>>> As I understand it, pushing these into a library has been proposed but
>>> not rejected.  That it hasn't happened yet is mostly about the lack of
>>> tuits (the round ones) to rewrite the functionality as libraries and
>>> refactor pg_dump/pg_restore to use only calls to same.  As usual, it's
>>> less about writing the code and more about the enormous amount of
>>> testing any such a refactor would entail.
>>
>> My understanding as well. IIRC Jon Erdman brought this question up a 
>> couple years ago and the response was "It'd probably be accepted, it's 
>> just that no one has done the work."
>>
>>> I believe that refactoring much of pg_dump's functionality for the
>>> current version of the server into SQL-accessible functions and making
>>> pg_dump use only those functions is achievable with available
>>> resources.
>>>
>>> Such a refactor need not be all-or-nothing.  For example, the
>>> dependency resolution stuff is a first step that appears to be worth
>>> doing by itself even if the effort then pauses, possibly for some
>>> time.
>>
>> If someone wanted to spend time on this, I suspect it'd be worth 
>> looking at how bad some of the backward compatibility issues would be 
>> if done in the server. Maybe they wouldn't be that bad. I suspect the 
>> audience for this code would be much larger if it was in the server as 
>> opposed to a C library.