Thread: Some questions about mammoth replication

Some questions about mammoth replication

From
Hannu Krosing
Date:
> 
> btw, can you publicly discuss how CommandPrompts  WAL-based
> replication works ? 

It's my company, if course I am ;)... but not in this thread. If you
are interested feel free to email me directly or start a new thread.

Good :)

Here come my questions :

>From looking at http://www.commandprompt.com/images/MR_components.jpg it
seems that you don't do replication just from WAL logs, but also collect
some extra info inside postgreSQL server. Is this so ?

If it is, then in what way does it differ from simple trigger-based
change logging ?

Do you make use of snapshot data, to make sure, what parts of WAL log
are worth migrating to slaves , or do you just apply everything in WAL
in separate transactions and abort if you find out that original
transaction aborted ?

Are your slaves a) standby b) read-only or c) read-write ?

Do you extract / generate full sql DML queries from data in WAL logs, or
do you apply the changes at some lower level ?

For what use cases do you think your WAL-based approach is better than
Slony/Skytools trigger-based one ?

--------------
Hannu





Re: Some questions about mammoth replication

From
Alexey Klyukin
Date:
Hello,

Hannu Krosing wrote:
> 
> Here come my questions :
> 
> >From looking at http://www.commandprompt.com/images/MR_components.jpg it
> seems that you don't do replication just from WAL logs, but also collect
> some extra info inside postgreSQL server. Is this so ?
>
> If it is, then in what way does it differ from simple trigger-based
> change logging ?

We have hooks in executor calling our own collecting functions, so we
don't need the trigger machinery to launch replication.

> Do you make use of snapshot data, to make sure, what parts of WAL log
> are worth migrating to slaves , or do you just apply everything in WAL
> in separate transactions and abort if you find out that original
> transaction aborted ?

We check if a data transaction is recorded in WAL before sending
it to a slave. For an aborted transaction we just discard all data collected 
from that transaction.

> 
> Are your slaves a) standby b) read-only or c) read-write ?

Replicated relations are read-only on slaves.

> 
> Do you extract / generate full sql DML queries from data in WAL logs, or
> do you apply the changes at some lower level ?

We replicate the binary data along with a command type. Only the data
necessary to replay the command on a slave are replicated.

> 
> For what use cases do you think your WAL-based approach is better than
> Slony/Skytools trigger-based one ?

A pure trigger based approach can only replicate data for the commands
which fire triggers. AFAIK Slony is unable to replicate TRUNCATE
command (I don't know if Skytools can). Replicator doesn't have this
limitation.

Regards,
-- 
Alexey Klyukin                         http://www.commandprompt.com/
The PostgreSQL Company - Command Prompt, Inc.



Re: Some questions about mammoth replication

From
Andreas Pflug
Date:
Alexey Klyukin wrote:
>
>   
>> For what use cases do you think your WAL-based approach is better than
>> Slony/Skytools trigger-based one ?
>>     
>
> A pure trigger based approach can only replicate data for the commands
> which fire triggers. AFAIK Slony is unable to replicate TRUNCATE
> command
It could be wrapped with ddl_script which obviously isn't transparent to
the application, but I guess a table that's truncated regularly won't be
a typical candidate  for (async) replication either.

Regards,
Andreas



Re: Some questions about mammoth replication

From
"Marko Kreen"
Date:
On 10/11/07, Alexey Klyukin <alexk@commandprompt.com> wrote:
> Hannu Krosing wrote:
> > For what use cases do you think your WAL-based approach is better than
> > Slony/Skytools trigger-based one ?
>
> A pure trigger based approach can only replicate data for the commands
> which fire triggers. AFAIK Slony is unable to replicate TRUNCATE
> command (I don't know if Skytools can). Replicator doesn't have this
> limitation.

No, Skytools is same as Slony.

Can you also replicate changes to system tables?

-- 
marko


Re: Some questions about mammoth replication

From
Alexey Klyukin
Date:
Marko Kreen wrote:
> On 10/11/07, Alexey Klyukin <alexk@commandprompt.com> wrote:
> > Hannu Krosing wrote:
> > > For what use cases do you think your WAL-based approach is better than
> > > Slony/Skytools trigger-based one ?
> >
> > A pure trigger based approach can only replicate data for the commands
> > which fire triggers. AFAIK Slony is unable to replicate TRUNCATE
> > command (I don't know if Skytools can). Replicator doesn't have this
> > limitation.
> 
> No, Skytools is same as Slony.
> 
> Can you also replicate changes to system tables?

No, we need a table to have primary key to be able to replicate it. From
the other hand replicating a system relation can be dangerous, i.e. what if
you replicate the contents of pg_class without corresponing relations on the slave, that's why explicitly forbid
enablingreplication for relations from pg_catalog namespace.
 

Regards,
-- 
Alexey Klyukin                         http://www.commandprompt.com/
The PostgreSQL Company - Command Prompt, Inc.



Re: Some questions about mammoth replication

From
"Joshua D. Drake"
Date:
On Thu, 11 Oct 2007 19:10:18 +0300
Alexey Klyukin <alexk@commandprompt.com> wrote:

> Marko Kreen wrote:
> > On 10/11/07, Alexey Klyukin <alexk@commandprompt.com> wrote:
> > > Hannu Krosing wrote:
> > > > For what use cases do you think your WAL-based approach is
> > > > better than Slony/Skytools trigger-based one ?
> > >
> > > A pure trigger based approach can only replicate data for the
> > > commands which fire triggers. AFAIK Slony is unable to replicate
> > > TRUNCATE command (I don't know if Skytools can). Replicator
> > > doesn't have this limitation.
> >
> > No, Skytools is same as Slony.
> >
> > Can you also replicate changes to system tables?

>
> No, we need a table to have primary key to be able to replicate it.
> From the other hand replicating a system relation can be dangerous,
> i.e. what if you replicate the contents of pg_class without
> corresponing relations on the slave, that's why explicitly forbid
> enabling replication for relations from pg_catalog namespace.

We can however replicate large objects and specific system commands
like GRANT/REVOKE and CREATE/DROP user. The latter being done via
triggers.

Joshua D. Drake

>
> Regards,


--
     === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564   24x7/Emergency: +1.800.492.2240
PostgreSQL solutions since 1997  http://www.commandprompt.com/        UNIQUE NOT NULL
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/


Re: Some questions about mammoth replication

From
Hannu Krosing
Date:
Ühel kenal päeval, N, 2007-10-11 kell 18:25, kirjutas Alexey Klyukin:
> Hello,
> 
> Hannu Krosing wrote:
> > 
> > Here come my questions :
> > 
> > >From looking at http://www.commandprompt.com/images/MR_components.jpg it
> > seems that you don't do replication just from WAL logs, but also collect
> > some extra info inside postgreSQL server. Is this so ?
> >
> > If it is, then in what way does it differ from simple trigger-based
> > change logging ?
> 
> We have hooks in executor calling our own collecting functions, so we
> don't need the trigger machinery to launch replication.

But where do you store the collected info - in your own replication_log
table, or do reuse data in WAL you extract it on master befor
replication to slave (or on slave after moving the WAL) ?

> > Do you make use of snapshot data, to make sure, what parts of WAL log
> > are worth migrating to slaves , or do you just apply everything in WAL
> > in separate transactions and abort if you find out that original
> > transaction aborted ?
> 
> We check if a data transaction is recorded in WAL before sending
> it to a slave. For an aborted transaction we just discard all data collected 
> from that transaction.

Do you duplicate postgresql's MVCC code for that, or will this happen
automatically via using MVCC itself for collected data ?

How do you handle really large inserts/updates/deletes, which change say 10M 
rows in one transaction ?

> > Are your slaves a) standby b) read-only or c) read-write ?
> 
> Replicated relations are read-only on slaves.
> 
> > 
> > Do you extract / generate full sql DML queries from data in WAL logs, or
> > do you apply the changes at some lower level ?
> 
> We replicate the binary data along with a command type. Only the data
> necessary to replay the command on a slave are replicated.

Do you replay it as SQL insert/update/delete commands, or directly on
heap/indexes ?

------------
Hannu








Re: Some questions about mammoth replication

From
"Joshua D. Drake"
Date:
On Thu, 11 Oct 2007 21:58:45 +0300
Hannu Krosing <hannu@skype.net> wrote:


> > We have hooks in executor calling our own collecting functions, so
> > we don't need the trigger machinery to launch replication.
>
> But where do you store the collected info - in your own
> replication_log table,

No, we have our own transaction log outside the database.

> or do reuse data in WAL you extract it on
> master befor replication to slave (or on slave after moving the WAL) ?
>

We currently make zero use of WAL except to verify that the replication
data can be written.

> > > Do you make use of snapshot data, to make sure, what parts of WAL
> > > log are worth migrating to slaves , or do you just apply
> > > everything in WAL in separate transactions and abort if you find
> > > out that original transaction aborted ?
> >
> > We check if a data transaction is recorded in WAL before sending
> > it to a slave. For an aborted transaction we just discard all data
> > collected from that transaction.
>
> Do you duplicate postgresql's MVCC code for that, or will this happen
> automatically via using MVCC itself for collected data ?
>

Gonna have to wait for Alexey for the rest.

Joshua D. Drake





--
     === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564   24x7/Emergency: +1.800.492.2240
PostgreSQL solutions since 1997  http://www.commandprompt.com/        UNIQUE NOT NULL
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/


Re: Some questions about mammoth replication

From
Alexey Klyukin
Date:
Hannu Krosing wrote:

> > We have hooks in executor calling our own collecting functions, so we
> > don't need the trigger machinery to launch replication.
> 
> But where do you store the collected info - in your own replication_log
> table, or do reuse data in WAL you extract it on master befor
> replication to slave (or on slave after moving the WAL) ?

We don't use either a log table in database or WAL. The data to
replicate is stored in disk files, one per transaction. As Joshua said,
the WAL is used to ensure that only those transactions that are recorded
as committed in WAL are sent to slaves.

> 
> > > Do you make use of snapshot data, to make sure, what parts of WAL log
> > > are worth migrating to slaves , or do you just apply everything in WAL
> > > in separate transactions and abort if you find out that original
> > > transaction aborted ?
> > 
> > We check if a data transaction is recorded in WAL before sending
> > it to a slave. For an aborted transaction we just discard all data collected 
> > from that transaction.
> 
> Do you duplicate postgresql's MVCC code for that, or will this happen
> automatically via using MVCC itself for collected data ?

Every transaction command that changes data in a replicated relation is
stored on disk. PostgreSQL MVCC code is used on a slave in a natural way
when transaction commands are replayed there.

> 
> How do you handle really large inserts/updates/deletes, which change say 10M 
> rows in one transaction ?

We produce really large disk files ;). When a transaction commits - a
special queue lock is acquired and transaction is enqueued to a sending
queue. Since the locking mode for that lock is exclusive a commit of a
very large transaction would delay commits of other transactions until
the lock is held. We are working on minimizing the time of holding this
lock in the new version of Replicator.

> > > Do you extract / generate full sql DML queries from data in WAL logs, or
> > > do you apply the changes at some lower level ?
> > 
> > We replicate the binary data along with a command type. Only the data
> > necessary to replay the command on a slave are replicated.
> 
> Do you replay it as SQL insert/update/delete commands, or directly on
> heap/indexes ?

We replay the commands directly using heap/index functions on a slave.

Regards,

-- 
Alexey Klyukin                         http://www.commandprompt.com/
The PostgreSQL Company - Command Prompt, Inc.



Re: Some questions about mammoth replication

From
Hannu Krosing
Date:
Ühel kenal päeval, R, 2007-10-12 kell 12:39, kirjutas Alexey Klyukin:
> Hannu Krosing wrote:
> 
> > > We have hooks in executor calling our own collecting functions, so we
> > > don't need the trigger machinery to launch replication.
> > 
> > But where do you store the collected info - in your own replication_log
> > table, or do reuse data in WAL you extract it on master befor
> > replication to slave (or on slave after moving the WAL) ?
> 
> We don't use either a log table in database or WAL. The data to
> replicate is stored in disk files, one per transaction.

Clever :)

How well does it scale ? That is, at what transaction rate can your
replication keep up with database ?

>  As Joshua said,
> the WAL is used to ensure that only those transactions that are recorded
> as committed in WAL are sent to slaves.

How do you force correct commit order of applying the transactions ?

> > 
> > > > Do you make use of snapshot data, to make sure, what parts of WAL log
> > > > are worth migrating to slaves , or do you just apply everything in WAL
> > > > in separate transactions and abort if you find out that original
> > > > transaction aborted ?
> > > 
> > > We check if a data transaction is recorded in WAL before sending
> > > it to a slave. For an aborted transaction we just discard all data collected 
> > > from that transaction.
> > 
> > Do you duplicate postgresql's MVCC code for that, or will this happen
> > automatically via using MVCC itself for collected data ?
> 
> Every transaction command that changes data in a replicated relation is
> stored on disk. PostgreSQL MVCC code is used on a slave in a natural way
> when transaction commands are replayed there.

Do you replay several transaction files in the same transaction on
slave ?

Can you replay several transaction files in parallel ?

> > How do you handle really large inserts/updates/deletes, which change say 10M 
> > rows in one transaction ?
> 
> We produce really large disk files ;). When a transaction commits - a
> special queue lock is acquired and transaction is enqueued to a sending
> queue. 
> Since the locking mode for that lock is exclusive a commit of a
> very large transaction would delay commits of other transactions until
> the lock is held. We are working on minimizing the time of holding this
> lock in the new version of Replicator.

Why does it take longer to queue a large file ? dou you copy data from
one file to another ?

> > > > Do you extract / generate full sql DML queries from data in WAL logs, or
> > > > do you apply the changes at some lower level ?
> > > 
> > > We replicate the binary data along with a command type. Only the data
> > > necessary to replay the command on a slave are replicated.
> > 
> > Do you replay it as SQL insert/update/delete commands, or directly on
> > heap/indexes ?
> 
> We replay the commands directly using heap/index functions on a slave.

Does that mean that the table structures will be exactly the same on
both master slave ? That is, do you replicate a physical table image
(maybe not including transaction ids on master) ?

Or you just use lower-level versions on INSERT/UPDATE/DELETE ?

---------------------
Hannu





Re: Some questions about mammoth replication

From
Alexey Klyukin
Date:
Hannu Krosing wrote:

> > We don't use either a log table in database or WAL. The data to
> > replicate is stored in disk files, one per transaction.
> 
> Clever :)
> 
> How well does it scale ? That is, at what transaction rate can your
> replication keep up with database ?

This depend on a number of concurrent transactions (the data is
collected by every backend process), max transaction size etc. I don't
have numbers here, sorry.

> 
> >  As Joshua said,
> > the WAL is used to ensure that only those transactions that are recorded
> > as committed in WAL are sent to slaves.
> 
> How do you force correct commit order of applying the transactions ?

The first transaction that is committed in PostgreSQL is the first
transaction placed into the queue, and the first that is restored by the
slave.

> 
> > > 
> > > > > Do you make use of snapshot data, to make sure, what parts of WAL log
> > > > > are worth migrating to slaves , or do you just apply everything in WAL
> > > > > in separate transactions and abort if you find out that original
> > > > > transaction aborted ?
> > > > 
> > > > We check if a data transaction is recorded in WAL before sending
> > > > it to a slave. For an aborted transaction we just discard all data collected 
> > > > from that transaction.
> > > 
> > > Do you duplicate postgresql's MVCC code for that, or will this happen
> > > automatically via using MVCC itself for collected data ?
> > 
> > Every transaction command that changes data in a replicated relation is
> > stored on disk. PostgreSQL MVCC code is used on a slave in a natural way
> > when transaction commands are replayed there.
> 
> Do you replay several transaction files in the same transaction on
> slave ?

> Can you replay several transaction files in parallel ?

No, we have plans for concurrent restore of replicated data, but
currently we a single slave process responsible for restoring data
received from MCP.

> 
> > > How do you handle really large inserts/updates/deletes, which change say 10M 
> > > rows in one transaction ?
> > 
> > We produce really large disk files ;). When a transaction commits - a
> > special queue lock is acquired and transaction is enqueued to a sending
> > queue. 
> > Since the locking mode for that lock is exclusive a commit of a
> > very large transaction would delay commits of other transactions until
> > the lock is held. We are working on minimizing the time of holding this
> > lock in the new version of Replicator.
> 
> Why does it take longer to queue a large file ? dou you copy data from
> one file to another ?

Yes, currently the data is copied from the transaction files into the
queue (this doesn't apply to dump transactions).

However, we have recently changed this, the new code will acquire the
queue lock only to record transaction as committed in replication log
without moving the data.

> > > Do you replay it as SQL insert/update/delete commands, or directly on
> > > heap/indexes ?
> > 
> > We replay the commands directly using heap/index functions on a slave.
> 
> Does that mean that the table structures will be exactly the same on
> both master slave ? 

Yes, the table structure on the slaves should match the table structure
on master.

> That is, do you replicate a physical table image
> (maybe not including transaction ids on master) ?

Yes, we call this 'full dump', and it is fired automatically for every
replicated table. We replicate only data however, not DDL commands to
create/alter table or sequence.

> 
> Or you just use lower-level versions on INSERT/UPDATE/DELETE ?
> 
> ---------------------
> Hannu
> 
> 
> 

Regards,
-- 
Alexey Klyukin                         http://www.commandprompt.com/
The PostgreSQL Company - Command Prompt, Inc.