Thread: draft RFC: concept for partial, wal-based replication

draft RFC: concept for partial, wal-based replication

From

Hans-Juergen Schoenig -- PostgreSQL

Date:

30 October 2009, 13:19:33

hello ...

as my day has worked out quite nicely up to know i thought to f... it up
and post a new concept which has been requested by a customer. the goal
is to collect some feedback, ideas and so on (not to be mixed up with
"flames").
we have funding for this and we are trying to sort out how to do it the
best way. comments are welcome ...
note, this is a first draft i want to refine based on some comments.
here we go ...

Partial WAL Replication for PostgreSQL:
---------------------------------------

As of now the PostgreSQL community has provided patches and functionalities
which allow full WAL-based replication as well as hot-standby. To extend
this
functionality and to make PostgreSQL even more suitable for "enterprise"
computing than it is today, we have the commitment of a sponsor to fund
partial
replication for PostgreSQL 8.5 / 8.6.

This is the first draft of a proposal to make partial WAL-based
replication work
and to provide and additional set of fancy features to the community
which has
been waiting for real in-core replication for a decade or more.

Why partial replication?
------------------------

In some cases people have master servers which contain enormous amounts
of data
(XX TB or so). If more than just one replica of this data is needed it
might
happen that different slaves are used for different purposes. This
implies that
not all data will be used by all maschines.
An example: Consider a server at a phone company collecting phone calls,
billing
data, and maybe network routing data. Data is used by different
department and
one maschine is not enough to serve all three departments. With the new
functionality proposed here we could make 3 replicas each holding just a
group
of tables for specific tasks thus allowing people to buy cheaper
hardware for
slaves and use more maschines instead.

Current status:
---------------

Hot-standy and streaming replication have been a huge leap step forward
for the
community and what is proposed here will be an extension to those
patches and
functionalities. This concept is NOT aimed to replace anything - it is
mainly an
addon.

Nodes and replication filters:
------------------------------

As of 8.4 standby systems are done by creating an archive_command along
with a
base backup. Although it is easy to do some users still reported some
difficulties due to a total misunderstanding of PITR.

The idea is to add a functionality to add slaves like this:

CREATE REPLICA node_name CONNECT FROM SLAVE 'connect_string' TRANSFER COMMAND 'command' [ USING
replication_filter];

'command' would be any shell script copying data from the local master
to the
new database node called node_name. Replication filters can be used to
make X
replicas contain the same tables. Filtersets can be created like this:

CREATE REPLICATION FILTER filter_name [ EMPTY | FULL ] [ INCLUDE | EXCLUDE CHANGES ];

Replication filters can be modified ...

ALTER REPLICATION FILTER filter_name RENAME TO new_filtername;
ALTER REPLICATION FILTER filter_name { ADD | REMOVE } { TABLE | INDEX | SEQUENCE } object;

Filter sets can be dropped like this ...

DROP REPLICATION FILTER filter_name;

Internally CREATE REPLICA would initiate a base backup to the new slave
server
just like we would do it manually otherwise. The server would
automatically use
the user defined 'command' to copy one file after the other to the slave
box.
The idea is basically stolen from archive_command and friends. At this
stage we
either copy the entire instance as we would do it with a normal base
backup or
just what is needed (defined by the replication filter). Users would
automatically only copy data to a slave which is really needed there and
which
matches their filter config. If the copy is done, we can register the
new node
inside a system table and commit the transaction. Also, we can automatically
create a useful recovery.conf setup - we know how to connect from the
slave to
the master (we can use ' CONNECT FROM SLAVE [ USING ] ' to write a proper
recovery.conf file).

Tables can easily be added or removed from a replication filter with ALTER
REPLICATION FILTER.

Replicas can be removed easily:

DROP REPLICA node_name;

Why SQL to add a node? We are convinced that this is the most simplistic
way of
doing things. It is the most intuitive way of doing things. We believe it
gives users a real feeling of simplicity. The current way of doing base
backups
should stay in place as it is - it has proven to be nice for countless
tasks.
However, it is not suitable for managing 10 or more replicas easily.
Especially
not when they are not full blown copies of the master.

Technical ideas:
----------------

System tables:

We suggest to always replicate the entire system catalog. It woulde be
a total
disaster to try some other implementation. The same applies for other
tables - we
always replicate entire tables; no WHERE-clauses allowed when it comes to
replicating any table.
How can a query on the slave figure out if a table is around? The slave
just to
know "who it is". Then it can lookup easily from the replication filter
it is
using if a table is actually physically in place or not. If a table is
not in
place, we can easily error out.

Remove a table from the slave:

This is not too hard; the master received the command to kill a table
the slave.
We will send a request to remove all storage files related to a table
and adjust
the replication filter to make sure that the slave will not replay
content of
this table anymore.

Add a table to a slave:

This is slightly more tricky. We start collecting WAL for a table, stop
shipping
WAL, use the TRANSFER COMMAND to copy the files related to the table
added and
resume recovery / sending once the storage file is on the slave.

Addition stuff:

Of course there are many more consistency considerations here. We cannot
replicate an index if the table is not present, etc.

many thanks,
hans

--
Cybertec Schoenig & Schoenig GmbH
Reyergasse 9 / 2
A-2700 Wiener Neustadt
Web: www.postgresql-support.de

Re: draft RFC: concept for partial, wal-based replication

From

Boszormenyi Zoltan

Date:

29 November 2009, 15:56:58

Hi,

we tried to discuss on a lower level what should be needed
for a partial replication based on streaming replication.

a. transferring base data after a slave got added a relation/index/...
(and initial setup)
b. communicating the the slaves which relations they currently should
have available
c. splitting wal into different replication sets
d. some configuration frontend. Possibly directly via sql or via functions

I don't think its reasonable trying to discuss and implement this all
in one huge patch so I propose implementing at least 1) as a seperate
patch.

a) is very useful outside the context of this specific feature and
kind of a requisite so i suggest tackling that first.

Plan:
M: checkpoint, full page writes, access share lock on the relation
S: stop receiving wal
SM: Using $method transfer required base data for every required segment/fork
M: normal writes
S: restart wal replay

Do you see any fundamental problem with this?

Now there unforatunately are two different schools of thought how to implement transfering base data.
1. archive_command like transfer command taking a source path/host and target path/host  - very flexible (think e.g.
nottransferring the data for multiple slaves over the whole country)  - harder to setup  - more in style of classical
walarchiving
 
2. add the capability to the WAL Streaming patch's libpq based protocol  - no additional configuration needed  -
inflexible - makes usage from non streaming replication is impossible
 

I favor 1. but only lightly so.

Detail Questions:
- How to deal with multiple transfer requests at the same time? There would be a need for multiple full backup requests
forindividual tables by several clients at once. Currently pg_start_backup() isn't allowed from two clients in
parallel,the second one gets an error. We thought that pg_start_backup() and pg_stop_backup() can turn into simple
referencecounts. IIRC, WALs are still generated and _shipped to slaves_ during a full backup, they are simply not yet
appliedto base table files. So, in this case a pg_stop_backup() issued from a slave decreases refcount of the base
backupsand the slave can simply resume applying its newly received WALs to base files.
 
- Keep track of current number of transfers We would also need a way to query the refcount of the base backup, so if a
slavedies, the master can be recovered manually, so it can also resume applying leftover WALs.
 

b) The slave needs to know whether a relation got added to itself in
order to request a base backup of the relevant files.

I would suggest adding a new wal record type for this: - DROP_RELATION_SLAVE(node_id, relation) -
ADD_RELATION_SLAVE(node_id,relation)
 

We thought about two ways of administering the replication set:
- slaves with full replication, optionally and explicitely excluding
relations
- slaves with minimal replication, explicitely included relations
The above WAL record types will be needed both cases,
but we also need two new catalog tables:
- the slave nodes, indicating the type of the slave above
- explicit table indications (treated depending on the type of the slave)

Any out of band communication has severe problems with
crashes/unavailability of the slave or not allowing classic, non
streaming, wal replication.

Questions:
- How to deal with the fact that the slave may be unavailable during adding something to its replication set? -
Possiblyforbid all DDL to the table until the slave got the update
 

c)
What to filter:
Every slave gets a node_id which is assigned in a system catalog on the master
pg_node(nodeoid) (per cluster)
And a catalog contains all replicated relations.
pg_replication_set(nodeoid, classoid, acknowledged_on_slave) (per database)

How to filter:
Heap2, Heap, Btree, Gin, Gist, Storage, Sequence need to be filtered by database/relation.
I am by far not yet familiar enough with the relevant code to see if it is feasible and worthwile at all to filter
clog,transaction and multixact per database.
 

Where to filter:
I propose doing so in the walsender. While this would prohibit using
classical wal based standby I do not see a big problem in that.

If done via wal streaming it would be a simple addition of a node_id in PQstartXLogStreaming. This id obviously should
notbe resettable to something else...
 

Questions:
- How to deal with access to the different database-wide catalogs? - Storing that data cluster wide seems really ugly.
-read the code...
 

d) I do not have any strong or even moderate opinions about this. I think its sensible to get something prototypish,
functionbased done before deciding about the real interface and getting into syntax wars.
 


Steps:

1:
- Transfer of Relations during runtime - needed to use wal-splitting - internally:     - Possibility 1:  -
transfer_command= ... %filename%  - should not error out if data changes beneath it.  - called for every file (i.e.
everyfork and segment)  - how to deal with new segments?  - lock out writes after transferring the last segment and
recheck?  - Possibility 2:  - Own networking protocol  - Easier to setup  - less flexible  - more work- take heavy lock
frombeginning - sounds annoying but could easily dealt with later- How to transfer:  - stop wal replay on requesting
side - checkpoint on master  - start streaming wal for relation       - pg_start_backup() equiv  - transfer relation
data - pg_stop_backup() equiv.  - restart wal replay on relevant nodes
 

2:
- wal splitting for Heap2, Heap, Btree, Hash, Gin, Gist - all others either have to be replicated anyway or are
semanticallya bit more complex (sequences) - lower level does not care about dependencies?
 
- Additional system catalogs: pg_node(nodeoid) pg_replication_set(nodeoid, classoid)
- Using a hook in walsender - correct place, right?
- Extending the streamin protocol to transfer the node id from slave - 'slave_node_id' in PQstartXLogStreaming
- A slave errors out if a relation gets accessed which is not replicated to it
- There is some need of more communication for some of the actions: - remove table/index/... from replication set - add
table/index/...to replication set - both require action from the slave: stop replaying wal, request base data of * -
bothrequire action from the master: lock relation shared, checkpoint, stop wal reusage, full_page_writes     - could be
startedfrom slave - How should master/slave communicate     - extra wal type  - crash safe- PQgetXLogData returns -3,
userrequests actions from master  - how to handle crashes?      - possibly a 'acknowledged' flag in
pg_replication_set?-both is not exactly nice and race free  - Add to replication set, drop table before slave has
coughtup      - normal locking does not work (shutdown)    - using something around prepared transactions seems ugly
-synchronous operation?    - alter/drop table is only possible after acknowledgement from client- wal type seems nicer
-possibly allowing to do it via "classical" wal replay
 

We would not build any bells and whistles around this for now.
They would work as function calls.
Later one could build a SQL-ish interface





Hans-Juergen Schoenig -- PostgreSQL írta:
> hello ...
>
> as my day has worked out quite nicely up to know i thought to f... it
> up and post a new concept which has been requested by a customer. the
> goal is to collect some feedback, ideas and so on (not to be mixed up
> with "flames").
> we have funding for this and we are trying to sort out how to do it
> the best way. comments are welcome ...
> note, this is a first draft i want to refine based on some comments.
> here we go ...
>
>
> Partial WAL Replication for PostgreSQL:
> ---------------------------------------
>
> As of now the PostgreSQL community has provided patches and
> functionalities
> which allow full WAL-based replication as well as hot-standby. To
> extend this
> functionality and to make PostgreSQL even more suitable for "enterprise"
> computing than it is today, we have the commitment of a sponsor to
> fund partial
> replication for PostgreSQL 8.5 / 8.6.
>
> This is the first draft of a proposal to make partial WAL-based
> replication work
> and to provide and additional set of fancy features to the community
> which has
> been waiting for real in-core replication for a decade or more.
>
>
> Why partial replication?
> ------------------------
>
> In some cases people have master servers which contain enormous
> amounts of data
> (XX TB or so). If more than just one replica of this data is needed it
> might
> happen that different slaves are used for different purposes.  This
> implies that
> not all data will be used by all maschines.
> An example: Consider a server at a phone company collecting phone
> calls, billing
> data, and maybe network routing data. Data is used by different
> department and
> one maschine is not enough to serve all three departments. With the new
> functionality proposed here we could make 3 replicas each holding just
> a group
> of tables for specific tasks thus allowing people to buy cheaper
> hardware for
> slaves and use more maschines instead.
>
>
> Current status:
> ---------------
>
> Hot-standy and streaming replication have been a huge leap step
> forward for the
> community and what is proposed here will be an extension to those
> patches and
> functionalities. This concept is NOT aimed to replace anything - it is
> mainly an
> addon.
>
>
> Nodes and replication filters:
> ------------------------------
>
> As of 8.4 standby systems are done by creating an archive_command
> along with a
> base backup. Although it is easy to do some users still reported some
> difficulties due to a total misunderstanding of PITR.
>
> The idea is to add a functionality to add slaves like this:
>
> CREATE REPLICA node_name
>    CONNECT FROM SLAVE 'connect_string'
>    TRANSFER COMMAND 'command'
>    [ USING replication_filter ];
>
> 'command' would be any shell script copying data from the local master
> to the
> new database node called node_name. Replication filters can be used to
> make X
> replicas contain the same tables. Filtersets can be created like this:
>
> CREATE REPLICATION FILTER filter_name
>    [ EMPTY | FULL ] [ INCLUDE | EXCLUDE CHANGES ];
>
> Replication filters can be modified ...
>
> ALTER REPLICATION FILTER filter_name RENAME TO new_filtername;
> ALTER REPLICATION FILTER filter_name
>    { ADD | REMOVE } { TABLE | INDEX | SEQUENCE } object;
>
> Filter sets can be dropped like this ...
>
> DROP REPLICATION FILTER filter_name;
>
> Internally CREATE REPLICA would initiate a base backup to the new
> slave server
> just like we would do it manually otherwise. The server would
> automatically use
> the user defined 'command' to copy one file after the other to the
> slave box.
> The idea is basically stolen from archive_command and friends. At this
> stage we
> either copy the entire instance as we would do it with a normal base
> backup or
> just what is needed (defined by the replication filter). Users would
> automatically only copy data to a slave which is really needed there
> and which
> matches their filter config. If the copy is done, we can register the
> new node
> inside a system table and commit the transaction. Also, we can
> automatically
> create a useful recovery.conf setup - we know how to connect from the
> slave to
> the master (we can use ' CONNECT FROM SLAVE [ USING ] ' to write a proper
> recovery.conf file).
>
> Tables can easily be added or removed from a replication filter with
> ALTER
> REPLICATION FILTER.
>
> Replicas can be removed easily:
>
> DROP REPLICA node_name;
>
> Why SQL to add a node? We are convinced that this is the most
> simplistic way of
> doing things.  It is the most intuitive way of doing things.  We
> believe it
> gives users a real feeling of simplicity. The current way of doing
> base backups
> should stay in place as it is - it has proven to be nice for countless
> tasks.
> However, it is not suitable for managing 10 or more replicas easily.
> Especially
> not when they are not full blown copies of the master.
>
>
> Technical ideas:
> ----------------
>
> System tables:
>
> We suggest to always replicate the entire system catalog.  It woulde
> be a total
> disaster to try some other implementation. The same applies for other
> tables - we
> always replicate entire tables; no WHERE-clauses allowed when it comes to
> replicating any table.
> How can a query on the slave figure out if a table is around? The
> slave just to
> know "who it is". Then it can lookup easily from the replication
> filter it is
> using if a table is actually physically in place or not. If a table is
> not in
> place, we can easily error out.
>
>
> Remove a table from the slave:
>
> This is not too hard; the master received the command to kill a table
> the slave.
> We will send a request to remove all storage files related to a table
> and adjust
> the replication filter to make sure that the slave will not replay
> content of
> this table anymore.
>
>
> Add a table to a slave:
>
> This is slightly more tricky. We start collecting WAL for a table,
> stop shipping
> WAL, use the TRANSFER COMMAND to copy the files related to the table
> added and
> resume recovery / sending once the storage file is on the slave.
>
>
> Addition stuff:
>
> Of course there are many more consistency considerations here. We cannot
> replicate an index if the table is not present, etc.
>
>
>    many thanks,
>
>       hans
>
>


-- 
Bible has answers for everything. Proof:
"But let your communication be, Yea, yea; Nay, nay: for whatsoever is more
than these cometh of evil." (Matthew 5:37) - basics of digital technology.
"May your kingdom come" - superficial description of plate tectonics

----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
http://www.postgresql.at/

Re: draft RFC: concept for partial, wal-based replication

From

Itagaki Takahiro

Date:

29 November 2009, 22:56:37

Boszormenyi Zoltan <zb@cybertec.at> wrote:

> we tried to discuss on a lower level what should be needed
> for a partial replication based on streaming replication.

We need to discuss a "partial recovery" before the partial replication.

There are some related items in out ToDo list and previous discussions:

http://wiki.postgresql.org/wiki/Todo   - Allow WAL logging to be turned off for a table,     but the table might be
droppedor truncated during crash recovery   - Allow WAL logging to be turned off for a table,     but the table would
avoidbeing truncated/dropped

- rmgr_hook   http://archives.postgresql.org/pgsql-hackers/2008-12/msg01361.php

Could you try them first as a preparation for the partial replication?

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: draft RFC: concept for partial, wal-based replication

From

Fujii Masao

Date:

29 November 2009, 23:18:36

On Mon, Nov 30, 2009 at 4:56 AM, Boszormenyi Zoltan <zb@cybertec.at> wrote:
> I don't think its reasonable trying to discuss and implement this all
> in one huge patch so I propose implementing at least 1) as a seperate
> patch.

I agree with this development plan.

> Now there unforatunately are two different schools of thought how to implement transfering base data.
> 1. archive_command like transfer command taking a source path/host and target path/host
>   - very flexible (think e.g. not transferring the data for multiple slaves over the whole country)
>   - harder to setup
>   - more in style of classical wal archiving
> 2. add the capability to the WAL Streaming patch's libpq based protocol
>   - no additional configuration needed
>   - inflexible
>   - makes usage from non streaming replication is impossible
>
> I favor 1. but only lightly so.

I favor 2 ;) Because I think that it's too hard for users to set up
a transfer command. One of streaming replication's merits is that
users no longer need to specify a transfer command for log-shipping.
So users can configure and use replication without complex settings.
But, #1 would spoil this merit.

> Detail Questions:
> - How to deal with multiple transfer requests at the same time?
>  There would be a need for multiple full backup requests for
>  individual tables by several clients at once.
>  Currently pg_start_backup() isn't allowed from
>  two clients in parallel, the second one gets an error.
>  We thought that pg_start_backup() and pg_stop_backup()
>  can turn into simple reference counts. IIRC, WALs
>  are still generated and _shipped to slaves_ during
>  a full backup, they are simply not yet applied to
>  base table files. So, in this case a pg_stop_backup()
>  issued from a slave decreases refcount of the base backups
>  and the slave can simply resume applying its newly
>  received WALs to base files.

I'm not sure how. But at first multiple online-backup feature
rather than backup-shipping itself might have to be addressed.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: draft RFC: concept for partial, wal-based replication

From

Craig Ringer

Date:

30 November 2009, 02:55:29

Boszormenyi Zoltan wrote:

> c. splitting wal into different replication sets

Just a side note: in addition to its use for partial replication, this
might have potential for performance-prioritizing databases or tablespaces.

Being able to separate WAL logging so that different DBs, tablespaces,
etc went to different sets of WAL logs would allow a DBA to give some
databases or tablespaces dedicated WAL logging space on faster storage.
If partial recovery is implemented, it might also permit less important
databases to be logged to fast-but-unsafe storage such as a non-BBU disk
controller with write cache enabled, without putting more important
databases in the same cluster in danger.

More importantly, if the WAL writing was done in different wal writer
backends, the admin could also use nice and ionice to encourage the OS
to favour WAL logging for some DBs over others.

Currently all these things require splitting the install into multiple
clusters, incurring config management and backup overhead and most
importantly partitioning shared memory.

OTOH, even with split WAL logging, you still have the shared bgwriter to
contend with, and the effects of an unimportant query pushing data
related to more performance-critical DBs out of shm or OS cache. So
perhaps splitting the cluster is actually the best answer, and a
complete implementation of DB prioritization would land up looking a lot
like multiple Pg clusters multiplexed on one port anyway...

In any case, I thought it worth mentioning as something that may be
worth keeping in mind - or considering and disregarding - while looking
at the WAL changes involved in partial replication.

--
Craig Ringer

Re: draft RFC: concept for partial, wal-based replication

From

Andres Freund

Date:

30 November 2009, 04:14:28

On Monday 30 November 2009 03:57:11 Itagaki Takahiro wrote:
> Boszormenyi Zoltan <zb@cybertec.at> wrote:
> > we tried to discuss on a lower level what should be needed
> > for a partial replication based on streaming replication.
> We need to discuss a "partial recovery" before the partial replication.
If you do the filtering on the sending side you dont actually need partial 
recover in the sense that you filter in the rmgr or similar.

Or do I miss something?

Andres

Re: draft RFC: concept for partial, wal-based replication

From

Itagaki Takahiro

Date:

30 November 2009, 04:43:45

Andres Freund <andres@anarazel.de> wrote:

> > We need to discuss a "partial recovery" before the partial replication.
> If you do the filtering on the sending side you dont actually need partial 
> recover in the sense that you filter in the rmgr or similar.
> 
> Or do I miss something?

Sorry, I didn't explain well.

I just suggested the order of development. I think paritial recovery
is easier than partition replication because we don't need to think
network nor multiple clients in recovery.

Also, this feature can be developed on the Streming Replication, but
it is under development -- the code is not fixed. So, I'd suggest to
start the development from independent parts from Streaming Replication.

I believe the partial replication will require modules developed
for the partial recovery in the future.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: draft RFC: concept for partial, wal-based replication

From

Stefan Kaltenbrunner

Date:

30 November 2009, 05:33:10

Andres Freund wrote:
> On Monday 30 November 2009 03:57:11 Itagaki Takahiro wrote:
>> Boszormenyi Zoltan <zb@cybertec.at> wrote:
>>> we tried to discuss on a lower level what should be needed
>>> for a partial replication based on streaming replication.
>> We need to discuss a "partial recovery" before the partial replication.
> If you do the filtering on the sending side you dont actually need partial 
> recover in the sense that you filter in the rmgr or similar.
> 
> Or do I miss something?

the question is if filtering on the sending side is actually the "right 
thing" to do.
It increases the overhead and the complexity on the master, especially 
if you think about different (partial) replication agreements for 
different slaves and it might also be hard to integrate with the planned 
sync/async modes.
On the other hand if you filter on the master you might be able to avoid 
a lot of network traffic du to filtered wal records.
I think for a first step it might make more sense to look into doing the 
filtering on the receiving side and look into actual integration with SR 
at a later stage.

Stefan

Re: draft RFC: concept for partial, wal-based replication

From

Andres Freund

Date:

30 November 2009, 05:38:09

On Monday 30 November 2009 10:32:50 Stefan Kaltenbrunner wrote:
> Andres Freund wrote:
> > On Monday 30 November 2009 03:57:11 Itagaki Takahiro wrote:
> >> Boszormenyi Zoltan <zb@cybertec.at> wrote:
> >>> we tried to discuss on a lower level what should be needed
> >>> for a partial replication based on streaming replication.
> >>
> >> We need to discuss a "partial recovery" before the partial replication.
> >
> > If you do the filtering on the sending side you dont actually need
> > partial recover in the sense that you filter in the rmgr or similar.
> >
> > Or do I miss something?
> 
> the question is if filtering on the sending side is actually the "right
> thing" to do.
> It increases the overhead and the complexity on the master, especially
> if you think about different (partial) replication agreements for
> different slaves and it might also be hard to integrate with the planned
> sync/async modes.
> On the other hand if you filter on the master you might be able to avoid
> a lot of network traffic du to filtered wal records.
> I think for a first step it might make more sense to look into doing the
> filtering on the receiving side and look into actual integration with SR
> at a later stage.
I think filtering on the receiving side is harder by many degrees because you 
don't have an up 2 date copy of the catalog. I cant think of a design that 
does not impose severe constraints on catalog and especially replication 
settings to implement on the receiving side.

Andres

Re: draft RFC: concept for partial, wal-based replication

From

Robert Haas

Date:

30 November 2009, 07:24:39

On Nov 30, 2009, at 1:55 AM, Craig Ringer  
<craig@postnewspapers.com.au> wrote:

> Boszormenyi Zoltan wrote:
>
>> c. splitting wal into different replication sets
>
> Just a side note: in addition to its use for partial replication, this
> might have potential for performance-prioritizing databases or  
> tablespaces.
>
> Being able to separate WAL logging so that different DBs, tablespaces,
> etc went to different sets of WAL logs would allow a DBA to give some
> databases or tablespaces dedicated WAL logging space on faster  
> storage.
> If partial recovery is implemented, it might also permit less  
> important
> databases to be logged to fast-but-unsafe storage such as a non-BBU  
> disk
> controller with write cache enabled, without putting more important
> databases in the same cluster in danger.

The danger here is that if we make crash recovery more complex, we'll  
introduce subtle bugs that will only be discovered after someone's  
data is toast.

...Robert

Re: draft RFC: concept for partial, wal-based replication

From

Tom Lane

Date:

30 November 2009, 11:07:48

Craig Ringer <craig@postnewspapers.com.au> writes:
> Just a side note: in addition to its use for partial replication, this
> might have potential for performance-prioritizing databases or tablespaces.

> Being able to separate WAL logging so that different DBs, tablespaces,
> etc went to different sets of WAL logs would allow a DBA to give some
> databases or tablespaces dedicated WAL logging space on faster storage.

I don't think this can possibly work without introducing data corruption
issues.  What happens when a transaction touches tables in different
tablespaces?  You can't apply the changes out-of-order.
        regards, tom lane

Re: draft RFC: concept for partial, wal-based replication

From

Hans-Jürgen Schönig

Date:

30 November 2009, 12:47:39

On Nov 30, 2009, at 10:32 AM, Stefan Kaltenbrunner wrote:

> Andres Freund wrote:
>> On Monday 30 November 2009 03:57:11 Itagaki Takahiro wrote:
>>> Boszormenyi Zoltan <zb@cybertec.at> wrote:
>>>> we tried to discuss on a lower level what should be needed
>>>> for a partial replication based on streaming replication.
>>> We need to discuss a "partial recovery" before the partial
>>> replication.
>> If you do the filtering on the sending side you dont actually need
>> partial recover in the sense that you filter in the rmgr or similar.
>> Or do I miss something?
>
> the question is if filtering on the sending side is actually the
> "right thing" to do.
> It increases the overhead and the complexity on the master,
> especially if you think about different (partial) replication
> agreements for different slaves and it might also be hard to
> integrate with the planned sync/async modes.
> On the other hand if you filter on the master you might be able to
> avoid a lot of network traffic du to filtered wal records.
> I think for a first step it might make more sense to look into doing
> the filtering on the receiving side and look into actual integration
> with SR at a later stage.
>
>
> Stefan


hello ...

one problem with not-filtering on the master is that you will end up
with a lot of complexity if you start adding new tables to a replica
because you just cannot add tables as easy as when you are doing stuff
on the slave. the procedure seems ways more complex.
in addition to that you are sending WAL which has to be discarded
anyway.
we thought about filtering "outside the master" a lot but to me it did
not sound like good plan.
regards,
    hans


--
Cybertec Schönig & Schönig GmbH
Reyergasse 9 / 2
A-2700 Wiener Neustadt
Web: www.postgresql-support.de

Re: draft RFC: concept for partial, wal-based replication

From

Hans-Jürgen Schönig

Date:

30 November 2009, 12:51:37

>
> Just a side note: in addition to its use for partial replication, this
> might have potential for performance-prioritizing databases or
> tablespaces.



hello ...

this is an absolutely non-starter. the WAL is designed to be "hyper
ordered" and hyper critical. once you fuck up order you will end up
with a total disaster. WAL has to be applied in perfect order without
skipping depending objects and so on. any concept which tries to get
around those fundamental law is either broken.
hans


--
Cybertec Schönig & Schönig GmbH
Reyergasse 9 / 2
A-2700 Wiener Neustadt
Web: www.postgresql-support.de

Re: draft RFC: concept for partial, wal-based replication

From

Andres Freund

Date:

30 November 2009, 13:59:30

On Monday 30 November 2009 17:46:45 Hans-Jürgen Schönig wrote:
> On Nov 30, 2009, at 10:32 AM, Stefan Kaltenbrunner wrote:
> > the question is if filtering on the sending side is actually the
> > "right thing" to do.
> > It increases the overhead and the complexity on the master,
> > especially if you think about different (partial) replication
> > agreements for different slaves and it might also be hard to
> > integrate with the planned sync/async modes.
> > On the other hand if you filter on the master you might be able to
> > avoid a lot of network traffic du to filtered wal records.
> > I think for a first step it might make more sense to look into doing
> > the filtering on the receiving side and look into actual integration
> > with SR at a later stage.
> one problem with not-filtering on the master is that you will end up
> with a lot of complexity if you start adding new tables to a replica
> because you just cannot add tables as easy as when you are doing stuff
> on the slave. the procedure seems ways more complex.
> in addition to that you are sending WAL which has to be discarded
> anyway.
> we thought about filtering "outside the master" a lot but to me it did
> not sound like good plan.
One possibility for the far future would be to allow filtering on a slave as
well:

master ---- full replication ---> primary slave --- split ---> slaves

Possibly doing only catalog recovery on the primary slave. In my opinion thats
heaps more complex and not better in all situation. So I would probably write
it down as a nice idea but not more.

Andres

Re: draft RFC: concept for partial, wal-based replication

From

Craig Ringer

Date:

30 November 2009, 17:45:33

On 30/11/2009 11:07 PM, Tom Lane wrote:
> Craig Ringer<craig@postnewspapers.com.au>  writes:
>> Just a side note: in addition to its use for partial replication, this
>> might have potential for performance-prioritizing databases or tablespaces.
>
>> Being able to separate WAL logging so that different DBs, tablespaces,
>> etc went to different sets of WAL logs would allow a DBA to give some
>> databases or tablespaces dedicated WAL logging space on faster storage.
>
> I don't think this can possibly work without introducing data corruption
> issues.  What happens when a transaction touches tables in different
> tablespaces?  You can't apply the changes out-of-order.

Argh, good point, and one that should've been blindingly obvious.

At a database level something like that may still be handy, though I 
haven't the foggiest how one would handle the shared system catalogs.

--
Craig Ringer