Thread: Future In-Core Replication

Future In-Core Replication

From
Simon Riggs
Date:
I'm beginning to work on advanced additions to in-core replication for
PostgreSQL.

There are a number of additional features for existing single-master
replication still to achieve, but the key topics to be addressed are
major leaps forward in functionality. I hope to add useful features in
9.3, though realise that many things could take two or even more
release cycles to achieve. (The last set of features took 8 years, so
I'm hoping to do this a little faster).

Some of my 2ndQuadrant colleagues will be committing themselves to the
project also and we hope to work with the community in the normal way
to create new features. I mention this only to say that major skills
and resources will be devoted to this for the next release(s), not
that this is a private project.

Some people have talked about the need for "multi-master replication",
whereby 2+ databases communicate changes to one another. This topic
has been discussed in some depth in Computer Science academic papers,
most notably, "The Dangers of Replication and a Solution" by the late
Jim Gray. I've further studied this to the point where I have a
mathematical model of this that allows me to predict what our likely
success will be from implementing that. Without meaning to worry you,
MM replication alone is not a solution for large data or the general
case. For the general case, single master replication will continue to
be the most viable option. For large and distributed data sets, some
form of partitioning/sharding is required simply because full
multi-master replication just isn't viable at both volume and scale.
So my take on this is that MM is desirable, but is not the only thing
we need - we also need partial/filtered replication to make large
systems practical. Hence why I've been calling this the
"Bi-Directional Replication" project. I'm aware that paragraph alone
requires lots of explanation, which I hope to do both in writing and
in person at the forthcoming developer conference.

My starting point for designs is to focus on a key aspect: massive
change to the code base is not viable and any in-core solution must
look at minimally invasive changes. And of course, if it is in-core we
must also add robust, clear code with reasonable performance that do
not impede non-replication usage.

The use cases we will address are not focused on any one project or
user. I've distilled these points so far from talking to a wide
variety of users, from major enterprises to startups.

1. GEOGRAPHICALLY DISTRIBUTED - Large users require both High
Availability, which necessitates multiple nodes, as well as Disaster
Recovery, which necessitates geographically distributed nodes. So my
focus is not focused on "clustering" in the sense of Hadoop or Oracle
RAC, since those technologies require additional technologies to
provide DR, so my aim is to arrive at a coherent set of technologies
that provide all that we want. I'm aware that other projects *are*
focused on clustering, so even more reason not to try to
simultaneously invent the wheel.

2. COHERENT - With regard to the coherence, I note this thinking is
similar to the way that Oracle replication is evolving, where they
have multiple kinds of in-core replication and various purchased
technologies. We have a similar issue with regard to various external
projects. I very much hope that we can utilise the knowledge, code and
expertise of those other projects in the way we move forwards.

3. ONLINE UPGRADE - highly available distributed systems must have a
mechanism for online upgrade, otherwise they won't stay HA for long.
This challenge must be part of the solution, and incidentally should
be a useful goal in itself.

4. MULTI-MASTER - the ability to update data from a variety of locations

5. WRITE-SCALEABLE - the ability to partition data across nodes in a
way that allows the solution to improve beyond the write rate of a
single node.

Those are the basic requirements that I am trying to address. There
are a great many important details, but the core of this is probably
what I would call "logical replication", that is shipping changes to
other nodes in a way that does not tie us to the same physical
representation that recovery/streaming replication does now. Of
course, non-physical replication can take many forms.

The assumption of consistency across nodes is considered optional at
this point, and I hope to support both eagerly consistent and
eventually consistent approaches.

I'm aware that this is a broad topic and many people will want input
on this, and am also aware that will take much time. This post is more
about announcing the project, than discussing specific details.

My strategy for doing this is to come up with some designs and
prototypes of a few things that might be the best way forwards. By
building prototypes we will more quickly be able to address the key
questions before us. So there is currently work on research-based
development to allow wider discussion based upon something more than
just whiteboards. I'll be the first to explain things that don't work.
I also very much agree that "one size fits all" is the wrong strategy.
So there will be implementation options and parameters, and possibly
even multiple replication techniques.

I will also be organising a small-medium sized "Future of In-Core
Replication" meeting in Ottawa on Wed 16 May, 6-10pm. To avoid this
becoming an unworkably large meeting, this will be limited but is open
to highly technical PostgreSQL users who share these requirements, any
attendee of the main developer's meeting that wishes to attend and
other developers working on PostgreSQL replication/related topics.
That will also allow me to order enough pizza for everyone too. I'll
send out private invites to people whom I know (no spam) and I think
may be interested, but you are welcome to email me to get access.
(This will take me a day or two, so don't ping me back you didn't get
your invite).

I'm going to do my best to include the right set of features for the
majority of people, all focused on submissions to PostgreSQL core, not
any external project.

Best Regards

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Future In-Core Replication

From
Simon Riggs
Date:
On Thu, Apr 26, 2012 at 1:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

> I will also be organising a small-medium sized "Future of In-Core
> Replication" meeting in Ottawa on Wed 16 May, 6-10pm.

Thanks for such rapid response. I've put up a wiki page and will be
adding names as they come through

http://wiki.postgresql.org/wiki/PgCon2012CanadaInCoreReplicationMeeting

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Future In-Core Replication

From
Jim Nasby
Date:
On 4/26/12 7:41 AM, Simon Riggs wrote:
> 5. WRITE-SCALEABLE - the ability to partition data across nodes in a
> way that allows the solution to improve beyond the write rate of a
> single node.

It would be valuable to look at READ-SCALEABLE as well; specifically a second form of "synchronous" replication where
youcan read from a slave "immediately" after transaction commit and have the changes be visible. That ability would
makeit trivial to spread reads off of the master database.
 

My hope is this wouldn't be horribly painful to achieve if we relaxed the need to fsync the corresponding WAL on the
slave;kind of the opposite of the semi-synchronous mode we have now. My theory is that thanks to full page writes a
slaveshould normally have the necessary pages to handle a WAL record in cache, so actually applying the WAL change
shouldn'tbe horribly slow.
 
-- 
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net


Re: Future In-Core Replication

From
Josh Berkus
Date:
Simon,

So the idea is that you'll present briefly your intentions for 9.3 at
the developer meeting, and then have this in-depth afterwards?  Sounds
great.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: Future In-Core Replication

From
Robert Haas
Date:
On Thu, Apr 26, 2012 at 6:48 PM, Josh Berkus <josh@agliodbs.com> wrote:
> So the idea is that you'll present briefly your intentions for 9.3 at
> the developer meeting, and then have this in-depth afterwards?  Sounds
> great.

I really, really do not want the developer meeting to turn into a
series of presentations.  That's what we had last year, and it was
boring and unproductive.  Furthermore, it excludes from the
conversation everyone who doesn't fit into a room in Ottawa.  I think
that plans should be presented here, on pgsql-hackers, and the
developer meeting should be reserved for discussion of issues with
which everyone who will be there is already familiar.  If a
presentation is required, the developer meeting is the wrong forum,
IMHO of course.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Future In-Core Replication

From
Josh Berkus
Date:
Simon,

> I'm beginning to work on advanced additions to in-core replication for
> PostgreSQL.
...
> Those are the basic requirements that I am trying to address. There
> are a great many important details, but the core of this is probably
> what I would call "logical replication", that is shipping changes to
> other nodes in a way that does not tie us to the same physical
> representation that recovery/streaming replication does now. Of
> course, non-physical replication can take many forms.

So, I'm a bit confused.  You talk about this as "additions to in-core
replication", but then you talk about implementing logical replication,
which would NOT be an addition to Binary replication.  Can you explain
what you mean?


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: Future In-Core Replication

From
Simon Riggs
Date:
On Fri, Apr 27, 2012 at 1:26 AM, Josh Berkus <josh@agliodbs.com> wrote:
> Simon,
>
>> I'm beginning to work on advanced additions to in-core replication for
>> PostgreSQL.
> ...
>> Those are the basic requirements that I am trying to address. There
>> are a great many important details, but the core of this is probably
>> what I would call "logical replication", that is shipping changes to
>> other nodes in a way that does not tie us to the same physical
>> representation that recovery/streaming replication does now. Of
>> course, non-physical replication can take many forms.
>
> So, I'm a bit confused.  You talk about this as "additions to in-core
> replication", but then you talk about implementing logical replication,
> which would NOT be an addition to Binary replication.  Can you explain
> what you mean?

The key point is that there is a specific objective of including
additional features in-core. That places some restrictions, but also
offers some opportunities. Tight integration allows performance
improvements, as well as ease of use etc..

I'm not sure what you mean by "would not be an addition to binary
replication". Yes, for reasons most elegantly explained by Robert here
[http://rhaas.blogspot.co.uk/2011/02/case-for-logical-replication.html],
physical/binary replication puts too many restrictions on us and we
cannot solve all of the problems that way. I was unaware of Robert's
post, but it sets the scene clearly.

So the future of in-core replication, IMHO, is some form of
non-physical replication. There are various options there, but
anything that goes in would reuse significant parts of the existing
replication setup that already works so well. Put that another way:
the infrastructure for the secure and efficient transport of
replication messages is already in place. Reuse is also what makes
something useful be achievable in a reasonable timescale.

What we need to consider is the form of those new non-physicalWAL
messages, how they are built on the sender and how they are handled at
the receiving end.

What I'm hoping to do is to build a basic prototype of logical
replication using WAL translation, so we can inspect it to see what
the downsides are. It's an extremely non-trivial problem and so I
expect there to be mountains to climb. There are other routes to
logical replication, with messages marshalled in a similar way to
Slony/Londiste/Bucardo/Mammoth(?). So there are options, with
measurements to be made and discussions to be had.

It will take time for people to believe this is possible and longer to
analyse and agree the options.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Future In-Core Replication

From
Simon Riggs
Date:
On Fri, Apr 27, 2012 at 12:40 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Apr 26, 2012 at 6:48 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> So the idea is that you'll present briefly your intentions for 9.3 at
>> the developer meeting, and then have this in-depth afterwards?  Sounds
>> great.
>
> I really, really do not want the developer meeting to turn into a
> series of presentations.  That's what we had last year, and it was
> boring and unproductive.  Furthermore, it excludes from the
> conversation everyone who doesn't fit into a room in Ottawa.  I think
> that plans should be presented here, on pgsql-hackers, and the
> developer meeting should be reserved for discussion of issues with
> which everyone who will be there is already familiar.  If a
> presentation is required, the developer meeting is the wrong forum,
> IMHO of course.

I agree.

Obviously, the word presentation is the issue here. If one person
speaks and everybody else is silent that is not a good use of
anybody's time.

On any topic, I expect the introducer of that topic to set the scene,
establish the major questions and encourage discussion.

I would follow that form myself: I want to hear the wisdom of others
and alter my plans accordingly.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Future In-Core Replication

From
Josh Berkus
Date:
> What I'm hoping to do is to build a basic prototype of logical
> replication using WAL translation, so we can inspect it to see what
> the downsides are. 

Sounds like Mammoth.  You take a look at that?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: Future In-Core Replication

From
Simon Riggs
Date:
On Fri, Apr 27, 2012 at 6:59 PM, Josh Berkus <josh@agliodbs.com> wrote:
>
>> What I'm hoping to do is to build a basic prototype of logical
>> replication using WAL translation, so we can inspect it to see what
>> the downsides are.
>
> Sounds like Mammoth.  You take a look at that?

Well, they all sound similar. My info was that Mammoth was not WAL-based.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Future In-Core Replication

From
"Joshua D. Drake"
Date:
On 04/27/2012 11:33 AM, Simon Riggs wrote:
> Well, they all sound similar. My info was that Mammoth was not WAL-based. 
Mammoth was transaction log based but not WAL based.


Re: Future In-Core Replication

From
Christopher Browne
Date:
On Fri, Apr 27, 2012 at 4:11 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> What I'm hoping to do is to build a basic prototype of logical
> replication using WAL translation, so we can inspect it to see what
> the downsides are. It's an extremely non-trivial problem and so I
> expect there to be mountains to climb. There are other routes to
> logical replication, with messages marshalled in a similar way to
> Slony/Londiste/Bucardo/Mammoth(?). So there are options, with
> measurements to be made and discussions to be had.

I'll note that the latest version of Slony, expected to be 2.2 (which
generally seems to work, but we're stuck at the moment waiting to get
free cycles to QA it) has made a substantial change to its data
representation.

The triggers used to cook data into a sort of "fractional WHERE
clause," transforming an I/U/D into a string that you'd trivially
combine with the string INSERT INTO/UPDATE/DELETE FROM to get the
logical update.  If there was need to do anything fancier, you'd be
left having to have a "fractional SQL parser" to split the data out by
hand.

New in 2.2 is that the log data is split out into an array of text
values which means that if someone wanted to do some transformation,
such as filtering on value, or filtering out columns, they could
modify the application-of-updates code to query for the data that they
want to fiddle with.  No parser needed.

It's doubtless worthwhile to take a peek at that to make sure it
informs your data representation appropriately.  It's important to
have data represented in a fashion that is amenable to manipulation,
and that decidedly wasn't the case pre-2.2.

I wonder if a meaningful transport mechanism might involve combining:
a) A trigger that indicates that some data needs to be captured in a
"logical" form (rather than the presently pretty purely physical form
of WAL)
b) Perhaps a way of capturing logical updates in WAL
c) One of the old ideas that fell through was to try to capture commit
timestamps via triggers.  Doing it directly turned out to be too
controversial to get in.  Perhaps that's something that could be
captured via some process that parses WAL.

Something seems wrong about that in that it mixes together updates of
multiple forms into WAL, physical *and* logical, and perhaps that
implies that there should be an altogether separate "logical updates
log." (LUL?)  That still involves capturing updates in a duplicative
fashion, e.g. - WAL + LUL, which seems somehow wrong.  Or perhaps I'm
tilting at a windmill here.  With Slony/Londiste/Bucardo, we're
capturing "LUL" in some tables, meaning that it gets written both to
the tables' data files as well as WAL.  Adding a binary LUL eliminates
those table files and attendant WAL updates, thus providing some
savings.

[Insert a LULCATS joke here...]

Perhaps I've just had too much coffee...
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


Re: Future In-Core Replication

From
Simon Riggs
Date:
On Fri, Apr 27, 2012 at 11:50 PM, Christopher Browne <cbbrowne@gmail.com> wrote:
> On Fri, Apr 27, 2012 at 4:11 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> What I'm hoping to do is to build a basic prototype of logical
>> replication using WAL translation, so we can inspect it to see what
>> the downsides are. It's an extremely non-trivial problem and so I
>> expect there to be mountains to climb. There are other routes to
>> logical replication, with messages marshalled in a similar way to
>> Slony/Londiste/Bucardo/Mammoth(?). So there are options, with
>> measurements to be made and discussions to be had.
>
> I'll note that the latest version of Slony ...has made a substantial change to its data
> representation....

The basic model I'm working to is that "logical replication" will ship
Logical Change Records (LCRs) using the same transport mechanism that
we built for WAL.

How the LCRs are produced and how they are applied is a subject for
debate and measurement. We're lucky enough to have a variety of
mechanisms to compare, Slony 1.0/2.0, Slony 2.2/Londiste/Bucardo and
its worth adding WAL translation there also. My initial thought is
that WAL translation has many positive aspects to it and we are
investigating. There are also some variants on those themes, such as
the one you discussed above.

You probably won't recognise this as such, but I hope that people
might see that I'm hoping to build Slony 3.0, Londiste++ etc. At some
point, we'll all say "thats not Slony", but we'll also say (Josh
already did) "thats not binary replication". But it will be the
descendant of all.

Backwards compatibility is not a goal, please note, but only because
that will complicate matters intensely.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Future In-Core Replication

From
Hannu Krosing
Date:
On Sat, 2012-04-28 at 09:36 +0100, Simon Riggs wrote:
> On Fri, Apr 27, 2012 at 11:50 PM, Christopher Browne <cbbrowne@gmail.com> wrote:
> > On Fri, Apr 27, 2012 at 4:11 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> >> What I'm hoping to do is to build a basic prototype of logical
> >> replication using WAL translation, so we can inspect it to see what
> >> the downsides are. It's an extremely non-trivial problem and so I
> >> expect there to be mountains to climb. There are other routes to
> >> logical replication, with messages marshalled in a similar way to
> >> Slony/Londiste/Bucardo/Mammoth(?). So there are options, with
> >> measurements to be made and discussions to be had.
> >
> > I'll note that the latest version of Slony ...has made a substantial change to its data
> > representation....
> 
> The basic model I'm working to is that "logical replication" will ship
> Logical Change Records (LCRs) using the same transport mechanism that
> we built for WAL.

One outcome of this LCR approach is probably that you will be shipping
changes as they are made and on slave have to either apply them in N
parallel transactions and commit each transaction when the LCR for the
corresponding transaction says so, or you have to collect the LCR-s
before applying and then apply and commit committed-on-master
transactions in commit order and throw away the aborted ones.

The optimal approach will probably be some combination of these, that is
collect and apply short ones, start replay in separate transaction if
commit does not arrive in N ms.

As to what LCRs should contain, it will probably be locical equivalents
of INSERT, UPDATE ... LIMIT 1, DELETE ... LIMIT 1, TRUNCATE and all DDL.

The DDL could actually stay "raw" (as in LCRs for system tables) on
generator side as hopefully the rule that system tables cant have
triggers" does not apply when generating the LCR-s on WAL path.
If we need to go back to ALTER TABLE ... commands, then this is probably
the wisest to leave for client. Client here could also be some xReader
like middleman.

I would even go as far as propose a variant for DML-WITH-LIMIT-1 to be
added to postgresql's SQL syntax so that the LCRs could be converted to
SQL text for some tasks and thus should be easy to process using generic
text-based tools.
The DML-WITH-LIMIT-1 is required to do single logical updates on tables
with non-unique rows.
And as for any logical updates we will have huge performance problem
when doing UPDATE or DELETE on large table with no indexes, but
fortunately this problem is on slave, not master ;)

Generating and shipping the LCR-s at WAL-generation time or perhaps even
a bit earlier will have a huge performance benefit of not doing double
writing of captured events on the master which currently is needed for
several reasons, the main one being the determining of which
transactions do commit and in what order. (this cant be solved on master
without a local event log table as we dont have commit/rollback
triggers)

If we delegate that part out of the master then this alone enables us to
be almost as fast as WAL based replica in most cases, even when we have
different logical structure on slaves.

> How the LCRs are produced and how they are applied is a subject for
> debate and measurement. We're lucky enough to have a variety of
> mechanisms to compare, Slony 1.0/2.0, Slony 2.2/Londiste/Bucardo and
> its worth adding WAL translation there also. My initial thought is
> that WAL translation has many positive aspects to it and we are
> investigating. There are also some variants on those themes, such as
> the one you discussed above.
> 
> You probably won't recognise this as such, but I hope that people
> might see that I'm hoping to build Slony 3.0, Londiste++ etc. At some
> point, we'll all say "thats not Slony", but we'll also say (Josh
> already did) "thats not binary replication". But it will be the
> descendant of all.

If we get efficient and flexible logical change event generation on the
master, then I'm sure the current trigger-based logical replication
providers will switch (for full replication) or at least add and extra
LCR-source . It may still make sense to leave some flexibility to the
master side, so the some decisions - possibly even complex ones - could
be made when generating the LCR-s

What I would like is to have some of it exposed to userspace via
function which could be used by developers to push their own LCRs.

As metioned above, significant part of this approach can be prototyped
from user-level triggers as soon as we have triggers on commit and
rollback , even though at a slightly reduced performance. That is it
will still have the trigger overhead, but we can omit all the extra
writing and then re-reading and event-table management on the master. 

Wanting to play with Streaming Logical Replication (as opposed to
current Chunked Logical Replication) is also one of the reasons that I
complained when the "command triggers" patch was kicked out from 9.2.

> Backwards compatibility is not a goal, please note, but only because
> that will complicate matters intensely.

Currently there really is nothing similar enough this could be backward
compatible to :)

> -- 
>  Simon Riggs                   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
> 




Re: Future In-Core Replication

From
Hannu Krosing
Date:
On Sat, 2012-04-28 at 21:40 +0200, Hannu Krosing wrote:
> On Sat, 2012-04-28 at 09:36 +0100, Simon Riggs wrote:
> > On Fri, Apr 27, 2012 at 11:50 PM, Christopher Browne <cbbrowne@gmail.com> wrote:
> > > On Fri, Apr 27, 2012 at 4:11 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > >> What I'm hoping to do is to build a basic prototype of logical
> > >> replication using WAL translation, so we can inspect it to see what
> > >> the downsides are. It's an extremely non-trivial problem and so I
> > >> expect there to be mountains to climb. There are other routes to
> > >> logical replication, with messages marshalled in a similar way to
> > >> Slony/Londiste/Bucardo/Mammoth(?). So there are options, with
> > >> measurements to be made and discussions to be had.
> > >
> > > I'll note that the latest version of Slony ...has made a substantial change to its data
> > > representation....

Btw, Londiste has also moved from Slony-like trigger with partial SQL to
more general format a few versions ago.

I sure hope they will move to JSON as the data representation now when
we'll have JSON in core in 9.2

-- 
-------
Hannu Krosing
PostgreSQL Unlimited Scalability and Performance Consultant
2ndQuadrant Nordic
PG Admin Book: http://www.2ndQuadrant.com/books/



Re: Future In-Core Replication

From
Simon Riggs
Date:
On Sat, Apr 28, 2012 at 8:40 PM, Hannu Krosing <hannu@krosing.net> wrote:

> As to what LCRs should contain, it will probably be locical equivalents
> of INSERT, UPDATE ... LIMIT 1, DELETE ... LIMIT 1, TRUNCATE and all DDL.

Yeh

> I would even go as far as propose a variant for DML-WITH-LIMIT-1 to be
> added to postgresql's SQL syntax so that the LCRs could be converted to
> SQL text for some tasks and thus should be easy to process using generic
> text-based tools.
> The DML-WITH-LIMIT-1 is required to do single logical updates on tables
> with non-unique rows.
> And as for any logical updates we will have huge performance problem
> when doing UPDATE or DELETE on large table with no indexes, but
> fortunately this problem is on slave, not master ;)

While that is possible, I would favour the do-nothing approach. By
making the default replication mode = none, we then require a PK to be
assigned before allowing replication mode = on for a table. Trying to
replicate tables without PKs is a problem that can wait basically.

> Generating and shipping the LCR-s at WAL-generation time or perhaps even
> a bit earlier will have a huge performance benefit of not doing double
> writing of captured events on the master which currently is needed for
> several reasons, the main one being the determining of which
> transactions do commit and in what order. (this cant be solved on master
> without a local event log table as we dont have commit/rollback
> triggers)
>
> If we delegate that part out of the master then this alone enables us to
> be almost as fast as WAL based replica in most cases, even when we have
> different logical structure on slaves.

Agreed

>> How the LCRs are produced and how they are applied is a subject for
>> debate and measurement. We're lucky enough to have a variety of
>> mechanisms to compare, Slony 1.0/2.0, Slony 2.2/Londiste/Bucardo and
>> its worth adding WAL translation there also. My initial thought is
>> that WAL translation has many positive aspects to it and we are
>> investigating. There are also some variants on those themes, such as
>> the one you discussed above.
>>
>> You probably won't recognise this as such, but I hope that people
>> might see that I'm hoping to build Slony 3.0, Londiste++ etc. At some
>> point, we'll all say "thats not Slony", but we'll also say (Josh
>> already did) "thats not binary replication". But it will be the
>> descendant of all.
>
> If we get efficient and flexible logical change event generation on the
> master, then I'm sure the current trigger-based logical replication
> providers will switch (for full replication) or at least add and extra
> LCR-source . It may still make sense to leave some flexibility to the
> master side, so the some decisions - possibly even complex ones - could
> be made when generating the LCR-s
>
> What I would like is to have some of it exposed to userspace via
> function which could be used by developers to push their own LCRs.

Yes, I see that one coming.

That use case is not something I'm focused on, but I do recognise
others wish to pursue that.

The bit I'm not sure about is whether we have custom handler code as well.

> As metioned above, significant part of this approach can be prototyped
> from user-level triggers as soon as we have triggers on commit and
> rollback , even though at a slightly reduced performance. That is it
> will still have the trigger overhead, but we can omit all the extra
> writing and then re-reading and event-table management on the master.

Agreed.

> Wanting to play with Streaming Logical Replication (as opposed to
> current Chunked Logical Replication) is also one of the reasons that I
> complained when the "command triggers" patch was kicked out from 9.2.

Yeh. It's clear that project needs to move forwards quickly in 9.3 if
we are to make this Just Work in the way we hope.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Future In-Core Replication

From
Hannu Krosing
Date:
On Sun, 2012-04-29 at 12:03 +0100, Simon Riggs wrote:
> On Sat, Apr 28, 2012 at 8:40 PM, Hannu Krosing <hannu@krosing.net> wrote:
> 
> > As to what LCRs should contain, it will probably be locical equivalents
> > of INSERT, UPDATE ... LIMIT 1, DELETE ... LIMIT 1, TRUNCATE and all DDL.
> 
> Yeh
> 
> > I would even go as far as propose a variant for DML-WITH-LIMIT-1 to be
> > added to postgresql's SQL syntax so that the LCRs could be converted to
> > SQL text for some tasks and thus should be easy to process using generic
> > text-based tools.
> > The DML-WITH-LIMIT-1 is required to do single logical updates on tables
> > with non-unique rows.
> > And as for any logical updates we will have huge performance problem
> > when doing UPDATE or DELETE on large table with no indexes, but
> > fortunately this problem is on slave, not master ;)
> 
> While that is possible, I would favour the do-nothing approach. By
> making the default replication mode = none, we then require a PK to be
> assigned before allowing replication mode = on for a table. Trying to
> replicate tables without PKs is a problem that can wait basically.

While this is a good approach in most cases, there is a large use case
for pk-less / indexless tables in large logfiles, where you may want to
do INSERT only replication, perhaps with some automatic partitioning on
logdate. Allowing this is probably something to look at in the first
release, even though I'm not sure wht would happen on violation of this
insert-only policy.
Should it 
* refuse to continue and rollback the transaction (probably not)
* fail silently
* succeed but log the change locally 
* succseed with some special flags so other side can treat it specially
without having to look up stuff in system catalog
* (if we mark the unique / pk fields in some special way anyway, then
the previous one is free :)


-- 
-------
Hannu Krosing
PostgreSQL Unlimited Scalability and Performance Consultant
2ndQuadrant Nordic
PG Admin Book: http://www.2ndQuadrant.com/books/



Re: Future In-Core Replication

From
Simon Riggs
Date:
On Sun, Apr 29, 2012 at 6:20 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote:
> On Sun, 2012-04-29 at 12:03 +0100, Simon Riggs wrote:
>> On Sat, Apr 28, 2012 at 8:40 PM, Hannu Krosing <hannu@krosing.net> wrote:
>>
>> > As to what LCRs should contain, it will probably be locical equivalents
>> > of INSERT, UPDATE ... LIMIT 1, DELETE ... LIMIT 1, TRUNCATE and all DDL.
>>
>> Yeh
>>
>> > I would even go as far as propose a variant for DML-WITH-LIMIT-1 to be
>> > added to postgresql's SQL syntax so that the LCRs could be converted to
>> > SQL text for some tasks and thus should be easy to process using generic
>> > text-based tools.
>> > The DML-WITH-LIMIT-1 is required to do single logical updates on tables
>> > with non-unique rows.
>> > And as for any logical updates we will have huge performance problem
>> > when doing UPDATE or DELETE on large table with no indexes, but
>> > fortunately this problem is on slave, not master ;)
>>
>> While that is possible, I would favour the do-nothing approach. By
>> making the default replication mode = none, we then require a PK to be
>> assigned before allowing replication mode = on for a table. Trying to
>> replicate tables without PKs is a problem that can wait basically.
>
> While this is a good approach in most cases, there is a large use case
> for pk-less / indexless tables in large logfiles, where you may want to
> do INSERT only replication, perhaps with some automatic partitioning on
> logdate. Allowing this is probably something to look at in the first
> release, even though I'm not sure wht would happen on violation of this
> insert-only policy.
> Should it
> * refuse to continue and rollback the transaction (probably not)
> * fail silently
> * succeed but log the change locally
> * succseed with some special flags so other side can treat it specially
> without having to look up stuff in system catalog
> * (if we mark the unique / pk fields in some special way anyway, then
> the previous one is free :)

OK, I think an insert-only replication mode would allow that.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Future In-Core Replication

From
Alvaro Herrera
Date:
Excerpts from Simon Riggs's message of jue abr 26 11:10:09 -0300 2012:
> On Thu, Apr 26, 2012 at 1:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> > I will also be organising a small-medium sized "Future of In-Core
> > Replication" meeting in Ottawa on Wed 16 May, 6-10pm.
>
> Thanks for such rapid response. I've put up a wiki page and will be
> adding names as they come through
>
> http://wiki.postgresql.org/wiki/PgCon2012CanadaInCoreReplicationMeeting

How is this not redundant with the Cluster Summit?
http://wiki.postgresql.org/wiki/PgCon2012CanadaClusterSummit

... oh, you're also already enlisted in that one.  Sigh.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Future In-Core Replication

From
Dave Page
Date:
On Sun, Apr 29, 2012 at 11:20 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
>
> Excerpts from Simon Riggs's message of jue abr 26 11:10:09 -0300 2012:
>> On Thu, Apr 26, 2012 at 1:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>
>> > I will also be organising a small-medium sized "Future of In-Core
>> > Replication" meeting in Ottawa on Wed 16 May, 6-10pm.
>>
>> Thanks for such rapid response. I've put up a wiki page and will be
>> adding names as they come through
>>
>> http://wiki.postgresql.org/wiki/PgCon2012CanadaInCoreReplicationMeeting
>
> How is this not redundant with the Cluster Summit?
> http://wiki.postgresql.org/wiki/PgCon2012CanadaClusterSummit
>
> ... oh, you're also already enlisted in that one.  Sigh.

My understanding is that the agenda for the cluster meeting is almost
entirely dedicated to Postgres-XC.

--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Future In-Core Replication

From
Simon Riggs
Date:
On Mon, Apr 30, 2012 at 7:35 AM, Dave Page <dpage@pgadmin.org> wrote:
> On Sun, Apr 29, 2012 at 11:20 PM, Alvaro Herrera
> <alvherre@commandprompt.com> wrote:
>>
>> Excerpts from Simon Riggs's message of jue abr 26 11:10:09 -0300 2012:
>>> On Thu, Apr 26, 2012 at 1:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>>
>>> > I will also be organising a small-medium sized "Future of In-Core
>>> > Replication" meeting in Ottawa on Wed 16 May, 6-10pm.
>>>
>>> Thanks for such rapid response. I've put up a wiki page and will be
>>> adding names as they come through
>>>
>>> http://wiki.postgresql.org/wiki/PgCon2012CanadaInCoreReplicationMeeting
>>
>> How is this not redundant with the Cluster Summit?
>> http://wiki.postgresql.org/wiki/PgCon2012CanadaClusterSummit
>>
>> ... oh, you're also already enlisted in that one.  Sigh.
>
> My understanding is that the agenda for the cluster meeting is almost
> entirely dedicated to Postgres-XC.

Yes. Regrettably, there wasn't a 90 minute slot available to discuss
this at the cluster meeting, the dev meeting agenda is fairly full and
my presentation on replication was rejected for the main conference.
As a result, the additional meeting was the only way left open to me
to initiate detailed discussion.

The additional meeting will give us 240 minutes of discussion and
allows us to bring in some technical users as well.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Future In-Core Replication

From
Atri Sharma
Date:
On Mon, Apr 30, 2012 at 12:39 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Mon, Apr 30, 2012 at 7:35 AM, Dave Page <dpage@pgadmin.org> wrote:
>> On Sun, Apr 29, 2012 at 11:20 PM, Alvaro Herrera
>> <alvherre@commandprompt.com> wrote:
>>>
>>> Excerpts from Simon Riggs's message of jue abr 26 11:10:09 -0300 2012:
>>>> On Thu, Apr 26, 2012 at 1:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>>>
>>>> > I will also be organising a small-medium sized "Future of In-Core
>>>> > Replication" meeting in Ottawa on Wed 16 May, 6-10pm.
>>>>
>>>> Thanks for such rapid response. I've put up a wiki page and will be
>>>> adding names as they come through
>>>>
>>>> http://wiki.postgresql.org/wiki/PgCon2012CanadaInCoreReplicationMeeting
>>>
>>> How is this not redundant with the Cluster Summit?
>>> http://wiki.postgresql.org/wiki/PgCon2012CanadaClusterSummit
>>>
>>> ... oh, you're also already enlisted in that one.  Sigh.
>>
>> My understanding is that the agenda for the cluster meeting is almost
>> entirely dedicated to Postgres-XC.
>
> Yes. Regrettably, there wasn't a 90 minute slot available to discuss
> this at the cluster meeting, the dev meeting agenda is fairly full and
> my presentation on replication was rejected for the main conference.
> As a result, the additional meeting was the only way left open to me
> to initiate detailed discussion.
>
> The additional meeting will give us 240 minutes of discussion and
> allows us to bring in some technical users as well.
>
> --
>  Simon Riggs                   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

Hi All,

Will the live streaming/recording of the meeting be available?

Atri

--
Regards,

Atri
l'apprenant


Re: Future In-Core Replication

From
Bruce Momjian
Date:
On Thu, Apr 26, 2012 at 01:41:33PM +0100, Simon Riggs wrote:
> Some people have talked about the need for "multi-master replication",
> whereby 2+ databases communicate changes to one another. This topic
> has been discussed in some depth in Computer Science academic papers,
> most notably, "The Dangers of Replication and a Solution" by the late
> Jim Gray. I've further studied this to the point where I have a
> mathematical model of this that allows me to predict what our likely
> success will be from implementing that. Without meaning to worry you,
> MM replication alone is not a solution for large data or the general
> case. For the general case, single master replication will continue to
> be the most viable option. For large and distributed data sets, some
> form of partitioning/sharding is required simply because full
> multi-master replication just isn't viable at both volume and scale.
> So my take on this is that MM is desirable, but is not the only thing
> we need - we also need partial/filtered replication to make large
> systems practical. Hence why I've been calling this the
> "Bi-Directional Replication" project. I'm aware that paragraph alone
> requires lots of explanation, which I hope to do both in writing and
> in person at the forthcoming developer conference.

I would love to see a layout of exactly where these things make sense,
similar to what we do at the bottom of our documentation for "High
Availability, Load Balancing, and Replication":
http://www.postgresql.org/docs/9.1/static/different-replication-solutions.html

Users and developers just can't seem to get the calculus of where things
make sense into their heads, me included.

For example, you said that "MM replication alone is not a solution for
large data or the general case".  Why is that?  Is the goal of your work
really to do logical replciation, which allows for major version
upgrades?  Is that the defining feature?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: Future In-Core Replication

From
Merlin Moncure
Date:
On Mon, Apr 30, 2012 at 12:38 PM, Bruce Momjian <bruce@momjian.us> wrote:
> For example, you said that "MM replication alone is not a solution for
> large data or the general case".  Why is that?  Is the goal of your work
> really to do logical replciation, which allows for major version
> upgrades?  Is that the defining feature?

TBH, I don't think MM replication belongs in the database at all.
Ditto any replication solution that implements 'eventual consistency'
such that after the fact conflict resolution is required.  In an SQL
database, when a transaction commits, it should remain so.  It belongs
in the application layer.

merlin


Re: Future In-Core Replication

From
Simon Riggs
Date:
On Mon, Apr 30, 2012 at 6:38 PM, Bruce Momjian <bruce@momjian.us> wrote:

> I would love to see a layout of exactly where these things make sense,
> similar to what we do at the bottom of our documentation for "High
> Availability, Load Balancing, and Replication":
>
>        http://www.postgresql.org/docs/9.1/static/different-replication-solutions.html
>
> Users and developers just can't seem to get the calculus of where things
> make sense into their heads, me included.
>
> For example, you said that "MM replication alone is not a solution for
> large data or the general case".  Why is that?  Is the goal of your work
> really to do logical replciation, which allows for major version
> upgrades?  Is that the defining feature?

Good question.

The use case, its breadth and utility are always the first place I
start. I'm in the middle of writing a presentation that explains this
from first principles and will be discussing that at the PgCon
meeting. It's taken a long time to articulate that rather than make
leaps of assumption and belief.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Future In-Core Replication

From
Robert Haas
Date:
On Mon, Apr 30, 2012 at 2:33 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Mon, Apr 30, 2012 at 12:38 PM, Bruce Momjian <bruce@momjian.us> wrote:
>> For example, you said that "MM replication alone is not a solution for
>> large data or the general case".  Why is that?  Is the goal of your work
>> really to do logical replciation, which allows for major version
>> upgrades?  Is that the defining feature?
>
> TBH, I don't think MM replication belongs in the database at all.
> Ditto any replication solution that implements 'eventual consistency'
> such that after the fact conflict resolution is required.  In an SQL
> database, when a transaction commits, it should remain so.  It belongs
> in the application layer.

I basically agree, at least in the medium term.  The logical
replication solutions we have today generally seem to work by watching
the inserts, updates, and deletes go by and writing the changed tuples
to a side table.  This is not very performant, because it amounts to
writing the data four times: we have to write WAL for the original
change, write the data files for the original change, write more WAL
for the change records, and the write those data files.  Since all
large database solutions are eventually I/O-bound, this is not great.
Writing and flushing a separate replication log in parallel to WAL
would get us down to three writes, and extracting tuple data from the
existing WAL would get us down to two writes, which is as well as we
ever know how to do.

If we just had that much in core - that is, the ability to efficiently
extra tuple inserts, updates, and deletes on a logical level - it
would be much easier to build a good logical replication system around
PostgreSQL than it is today, and the existing systems could be adapted
to deliver higher performance by making use of the new infrastructure.The other half of the changes - applying the
updates- is relatively 
straightforward, and it wouldn't bother me to leave that in user-land,
especially in the MMR case, where you have to deal with conflict
resolution rules that may be much simpler to express in a higher-level
language than they would be in C.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Future In-Core Replication

From
Bruce Momjian
Date:
On Mon, Apr 30, 2012 at 07:55:00PM +0100, Simon Riggs wrote:
> On Mon, Apr 30, 2012 at 6:38 PM, Bruce Momjian <bruce@momjian.us> wrote:
> 
> > I would love to see a layout of exactly where these things make sense,
> > similar to what we do at the bottom of our documentation for "High
> > Availability, Load Balancing, and Replication":
> >
> >        http://www.postgresql.org/docs/9.1/static/different-replication-solutions.html
> >
> > Users and developers just can't seem to get the calculus of where things
> > make sense into their heads, me included.
> >
> > For example, you said that "MM replication alone is not a solution for
> > large data or the general case".  Why is that?  Is the goal of your work
> > really to do logical replciation, which allows for major version
> > upgrades?  Is that the defining feature?
> 
> Good question.
> 
> The use case, its breadth and utility are always the first place I
> start. I'm in the middle of writing a presentation that explains this
> from first principles and will be discussing that at the PgCon
> meeting. It's taken a long time to articulate that rather than make
> leaps of assumption and belief.

Yep, it is the "assumption and belief" that always confuses me.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: Future In-Core Replication

From
"Kevin Grittner"
Date:
Robert Haas <robertmhaas@gmail.com> wrote:
> The other half of the changes - applying the updates - is
> relatively straightforward, and it wouldn't bother me to leave
> that in user-land, especially in the MMR case, where you have to
> deal with conflict resolution rules that may be much simpler to
> express in a higher-level language than they would be in C.
I've developed successful MMR more than once, and while I wouldn't
say it's exactly been implemented in the database, it hasn't exactly
been in application space either.  The most successful
implementations I've worked with have been a layer just outside the
database, of which application software was completely unaware.  The
database wasn't aware of the coordination per se; but it *did* need
to provide transaction information in a clean way, and the
declarations of how data was distributed were in the database.  In
my experience a declarative definition of data distribution has
always been sufficient, and certainly cleaner to deal with than
imperative coding would be.
YMMV.
-Kevin


Re: Future In-Core Replication

From
Merlin Moncure
Date:
On Mon, Apr 30, 2012 at 2:38 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Apr 30, 2012 at 2:33 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> On Mon, Apr 30, 2012 at 12:38 PM, Bruce Momjian <bruce@momjian.us> wrote:
>>> For example, you said that "MM replication alone is not a solution for
>>> large data or the general case".  Why is that?  Is the goal of your work
>>> really to do logical replciation, which allows for major version
>>> upgrades?  Is that the defining feature?
>>
>> TBH, I don't think MM replication belongs in the database at all.
>> Ditto any replication solution that implements 'eventual consistency'
>> such that after the fact conflict resolution is required.  In an SQL
>> database, when a transaction commits, it should remain so.  It belongs
>> in the application layer.
>
> I basically agree, at least in the medium term.  The logical
> replication solutions we have today generally seem to work by watching
> the inserts, updates, and deletes go by and writing the changed tuples
> to a side table.  This is not very performant, because it amounts to
> writing the data four times: we have to write WAL for the original
> change, write the data files for the original change, write more WAL
> for the change records, and the write those data files.  Since all
> large database solutions are eventually I/O-bound, this is not great.
> Writing and flushing a separate replication log in parallel to WAL
> would get us down to three writes, and extracting tuple data from the
> existing WAL would get us down to two writes, which is as well as we
> ever know how to do.
>
> If we just had that much in core - that is, the ability to efficiently
> extra tuple inserts, updates, and deletes on a logical level - it
> would be much easier to build a good logical replication system around
> PostgreSQL than it is today, and the existing systems could be adapted
> to deliver higher performance by making use of the new infrastructure.
>  The other half of the changes - applying the updates - is relatively
> straightforward, and it wouldn't bother me to leave that in user-land,
> especially in the MMR case, where you have to deal with conflict
> resolution rules that may be much simpler to express in a higher-level
> language than they would be in C.

Yeah -- here at $work the SQL Server team (once in a while we cross
no-man's land and converse) has some fancy technology that sits
directly on top of the transaction log and exposes an API that you can
use to peek into the river of data running through the log and do
stuff with it.  In our case, they use it to triage extracts from about
100 or so distributed databases into a centralized store in a
relatively realtime fashion.  HS/SR simply can't do that and there
would be tremendous value in something that could.

merlin


Re: Future In-Core Replication

From
Josh Berkus
Date:
>> If we just had that much in core - that is, the ability to efficiently
>> extra tuple inserts, updates, and deletes on a logical level - it
>> would be much easier to build a good logical replication system around
>> PostgreSQL than it is today, and the existing systems could be adapted
>> to deliver higher performance by making use of the new infrastructure.

Well, this *is* the purpose of the cluster-hackers group, to add backend
support which would make external replication systems easier to build
and more efficient.  So far the only real feature to come out of that
has been the Command Triggers, but if you read the TODO list of that
group you'll see that it's a laundry list of things replication systems
need support for in the backend.

http://wiki.postgresql.org/wiki/ClusterFeatures
http://wiki.postgresql.org/wiki/PgCon2012CanadaClusterSummit

What puts me off about this "let's start from first principles" approach
is that in our community we have years of experience (in a couple cases,
over a decade) with logical-level replication.  It seems like the last
thing we need is Yet Another PostgreSQL Replication System, started over
from scratch and years away from being production quality.  Don't we
have enough external replication systems with not enough developers
behind them?

Even if improving an existing replication system proves to be
impossible, it would make more sense to start with an analysis of the
strengths and deficiencies of Slony, Londiste, Bucardo, etc., than with
some kind of "clean-room" approach.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: Future In-Core Replication

From
Simon Riggs
Date:
On Mon, Apr 30, 2012 at 11:43 PM, Josh Berkus <josh@agliodbs.com> wrote:

> Well, this *is* the purpose of the cluster-hackers group

Well, I tried all available means to discuss my ideas before
organising an external meeting. You can think of the InCore meeting as
an extension of the cluster hackers meeting if you wish.

> What puts me off about this "let's start from first principles" approach
> is that in our community we have years of experience (in a couple cases,
> over a decade) with logical-level replication.  It seems like the last
> thing we need is Yet Another PostgreSQL Replication System, started over
> from scratch and years away from being production quality.  Don't we
> have enough external replication systems with not enough developers
> behind them?

I've tried very hard to express my admiration and respect for
developers of many replication systems both personally when we meet in
person and on list. And I've mentioned that I'd like to include as
many ideas as possible in an in-core approach. Yes, we have many
external replication systems. Many, many people have expressed the
desire for more coherent features in core.

I believe we can achieve production quality code in 1, maybe 2
releases. That is only possible by building on what we already have
and reusing the concepts, experience and perhaps even code from other
projects. We are closer to that than your statement allows. Yes, we
have enough external replication systems and that is one reason why we
need to put things in core and not just create another external
system.

> Even if improving an existing replication system proves to be
> impossible, it would make more sense to start with an analysis of the
> strengths and deficiencies of Slony, Londiste, Bucardo, etc., than with
> some kind of "clean-room" approach.

I'm not sure why you think I would *not* be starting with that analysis.

I look forward to discussing this in person, where I'm sure it will be easier.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Future In-Core Replication

From
Josh Berkus
Date:
>> Well, this *is* the purpose of the cluster-hackers group
> 
> Well, I tried all available means to discuss my ideas before
> organising an external meeting. You can think of the InCore meeting as
> an extension of the cluster hackers meeting if you wish.

That comment wasn't for you, it was for other folks who didn't seem to
be aware that some of these ideas had already been discussed.  You, of
course, know because you were there.

And, if you need a 3-hour meeting there's no question that you need the
evening thing.  There's just not enough time in anyone's schedule.

> I'm not sure why you think I would *not* be starting with that analysis.

That wasn't the impression I got from your prior emails to hackers.
Thanks for the clarification!

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: Future In-Core Replication

From
Tatsuo Ishii
Date:
> Those are the basic requirements that I am trying to address. There
> are a great many important details, but the core of this is probably
> what I would call "logical replication", that is shipping changes to
> other nodes in a way that does not tie us to the same physical
> representation that recovery/streaming replication does now. Of
> course, non-physical replication can take many forms.

Guessing from "shipping changes to other nodes", you seem to
implicitly aim at asynchronous replication? If so, I am afraid it will
force users to pay some cost to migrate from existig applications.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


Re: Future In-Core Replication

From
Simon Riggs
Date:
On Tue, May 1, 2012 at 1:10 AM, Tatsuo Ishii <ishii@postgresql.org> wrote:
>> Those are the basic requirements that I am trying to address. There
>> are a great many important details, but the core of this is probably
>> what I would call "logical replication", that is shipping changes to
>> other nodes in a way that does not tie us to the same physical
>> representation that recovery/streaming replication does now. Of
>> course, non-physical replication can take many forms.
>
> Guessing from "shipping changes to other nodes", you seem to
> implicitly aim at asynchronous replication? If so, I am afraid it will
> force users to pay some cost to migrate from existig applications.

The main plan is to use the existing streaming mechanism. That allows
changes/LCRs to be shipped synchronously to some nodes and
asynchronously to others.

We need a mechanism that works across continents, so eager replication
is not the most likely candidate. So the focus would be on shipping
changes after they have been made using lazy replication.

It sounds a bit strange but we can have synchronous lazy replication,
if you wish it.

None of the above presumes anything about the content or role of the
LCRs being shipped.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Future In-Core Replication

From
Robert Haas
Date:
On Mon, Apr 30, 2012 at 6:43 PM, Josh Berkus <josh@agliodbs.com> wrote:
> Well, this *is* the purpose of the cluster-hackers group, to add backend
> support which would make external replication systems easier to build
> and more efficient.  So far the only real feature to come out of that
> has been the Command Triggers, but if you read the TODO list of that
> group you'll see that it's a laundry list of things replication systems
> need support for in the backend.

Right, but the cluster-hackers group has essentially zero mailing list
traffic and seems to be making no progress on getting any of the
features they need into the backend, with the exception, as you say,
of Command Triggers.  So to say, oh, well, we don't need a new effort
to do this, we have cluster-hackers is just sticking our head in the
sand.  We obviously do need more of an effort than we've heretofore
had, and if Simon/2Q are willing to step it up a few notches, kudos to
them.  Hopefully I'll get a chance to participate as well.

One thing I am a bit uncomfortable with about this whole discussion is
that it seems like a lot of the design and research is happening
off-list, with intent to report results back here later.  While that
is obviously fine at some level, keeping the design under wraps until
the code is written is a strategy that's had notoriously poor success
in this community, and I hope we're not in for a repeat of that
strategy.  I think we ought to be sharing and debugging designs in
public, not internally within 2ndQuadrant - or any other company, or
any other mailing list other than this one.  Nobody enjoys doing a lot
of work on a patch and then having it get killed because Tom Lane - or
someone else - points out some critical flaw in the design, and the
way to avoid that is to try to flush out the flaws in public before
too much work goes into it.  On the flip side I certainly understand
that sometimes you need to take some time to get your house in order
before you can present a coherent public proposal, so don't take an
accusation that anyone is doing anything wrong, just a possible
concern.

> What puts me off about this "let's start from first principles" approach
> is that in our community we have years of experience (in a couple cases,
> over a decade) with logical-level replication.  It seems like the last
> thing we need is Yet Another PostgreSQL Replication System, started over
> from scratch and years away from being production quality.  Don't we
> have enough external replication systems with not enough developers
> behind them?

No.  We have, after over a decade of work, no meaningful core support
for the parts of replication that need core support in order to not
suck.  People are switching from logical replication of various types
to SR in droves, not because they like the draconian restrictions that
HS/SR impose, but rather because it has superior performance and
reliability characteristics.  It has those characteristics because it
has core support.  You cannot build an airplane by bolting wings onto
a car.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Future In-Core Replication

From
Simon Riggs
Date:
On Tue, May 1, 2012 at 2:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:

> One thing I am a bit uncomfortable with about this whole discussion is
> that it seems like a lot of the design and research is happening
> off-list, with intent to report results back here later.  While that
> is obviously fine at some level, keeping the design under wraps until
> the code is written is a strategy that's had notoriously poor success
> in this community, and I hope we're not in for a repeat of that
> strategy.  I think we ought to be sharing and debugging designs in
> public, not internally within 2ndQuadrant - or any other company, or
> any other mailing list other than this one.  Nobody enjoys doing a lot
> of work on a patch and then having it get killed because Tom Lane - or
> someone else - points out some critical flaw in the design, and the
> way to avoid that is to try to flush out the flaws in public before
> too much work goes into it.  On the flip side I certainly understand
> that sometimes you need to take some time to get your house in order
> before you can present a coherent public proposal, so don't take an
> accusation that anyone is doing anything wrong, just a possible
> concern.

I'm aware of the difficulties here. It's a bootstrap. If we start by
discussing it then we receive the predictable "that'll never work"
replies. If we start by building a prototype then we get the "you're
hiding it" replies. Neither is ideal.  I'm also well aware of the "if
only you'd showed me your design sooner, I could have saved you the
time" which is the stock response from every architect I ever met. The
truth is, as I'm sure you might guess, that I don't know everything,
so making a few mistakes early on allows us to avoid the time wasting
annoyance of early stage ideas.

All I can say is that I've done this a few times before and the
process is easier once we have a little structure in place, with
details like which bike shed we are painting and what the meaning of
the phrase "to paint" entails.

My hope is that we can present a basic set of facts and measurements
that will act as a platform for challenge and discussion of
requirements and ideas.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Future In-Core Replication

From
Jim Nasby
Date:
On 4/29/12 6:03 AM, Simon Riggs wrote:
>> The DML-WITH-LIMIT-1 is required to do single logical updates on tables
>> >  with non-unique rows.
>> >  And as for any logical updates we will have huge performance problem
>> >  when doing UPDATE or DELETE on large table with no indexes, but
>> >  fortunately this problem is on slave, not master;)
> While that is possible, I would favour the do-nothing approach. By
> making the default replication mode = none, we then require a PK to be
> assigned before allowing replication mode = on for a table. Trying to
> replicate tables without PKs is a problem that can wait basically.
>

Something that a in-core method might be able to do that an external one can't would be to support a method of uniquely
identifyingrows in tables with no PK's. A gross example (that undoubtedly wouldn't work in the real world) would be
usingTID's. A real-world implementation might be based on a hidden serial column.
 
-- 
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net


Re: Future In-Core Replication

From
Josh Berkus
Date:
On 5/2/12 10:58 PM, Jim Nasby wrote:
> On 4/29/12 6:03 AM, Simon Riggs wrote:
>>> The DML-WITH-LIMIT-1 is required to do single logical updates on tables
>>> >  with non-unique rows.
>>> >  And as for any logical updates we will have huge performance problem
>>> >  when doing UPDATE or DELETE on large table with no indexes, but
>>> >  fortunately this problem is on slave, not master;)
>> While that is possible, I would favour the do-nothing approach. By
>> making the default replication mode = none, we then require a PK to be
>> assigned before allowing replication mode = on for a table. Trying to
>> replicate tables without PKs is a problem that can wait basically.
>>
> 
> Something that a in-core method might be able to do that an external one
> can't would be to support a method of uniquely identifying rows in
> tables with no PK's. A gross example (that undoubtedly wouldn't work in
> the real world) would be using TID's. A real-world implementation might
> be based on a hidden serial column.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: Future In-Core Replication

From
Josh Berkus
Date:
> Something that a in-core method might be able to do that an external one
> can't would be to support a method of uniquely identifying rows in
> tables with no PK's. A gross example (that undoubtedly wouldn't work in
> the real world) would be using TID's. A real-world implementation might
> be based on a hidden serial column.

Realistically you need more than a serial for MM replication.  For each
row-version, you need:

serverID of last update
serialID of row
timestamp of last update

... and note that this would have to include deleted rows as well.

Currently Bucardo does this by using several fields, but you could put
together one 128-bit field which contains all of this information.  Or
you could do an Intagram and compress it into 64 bits, but that would
require limiting the problem space in a way you probably can't do it for
a general tool.


I do agree that depending on user-defined PKs raises a whole host of
issues which we'd rather just sidestep, though.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: Future In-Core Replication

From
Simon Riggs
Date:
On Thu, May 3, 2012 at 6:03 PM, Josh Berkus <josh@agliodbs.com> wrote:

> I do agree that depending on user-defined PKs raises a whole host of
> issues which we'd rather just sidestep, though.

What do you have in mind instead?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Future In-Core Replication

From
Greg Smith
Date:
On 05/01/2012 09:09 AM, Robert Haas wrote:
> I think we ought to be sharing and debugging designs in
> public, not internally within 2ndQuadrant - or any other company, or
> any other mailing list other than this one.

OK.  You go first.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


Re: Future In-Core Replication

From
Hannu Krosing
Date:
On Thu, 2012-05-03 at 00:58 -0500, Jim Nasby wrote:
> On 4/29/12 6:03 AM, Simon Riggs wrote:
> >> The DML-WITH-LIMIT-1 is required to do single logical updates on tables
> >> >  with non-unique rows.
> >> >  And as for any logical updates we will have huge performance problem
> >> >  when doing UPDATE or DELETE on large table with no indexes, but
> >> >  fortunately this problem is on slave, not master;)
> > While that is possible, I would favour the do-nothing approach. By
> > making the default replication mode = none, we then require a PK to be
> > assigned before allowing replication mode = on for a table. Trying to
> > replicate tables without PKs is a problem that can wait basically.
> >
> 
> Something that a in-core method might be able to do that an external one 
> can't would be to support a method of uniquely identifying rows in tables
> with no PK's. A gross example (that undoubtedly wouldn't work in the real 
> world) would be using TID's. 
> A real-world implementation might be based on a hidden serial column.

For logical we don't really need to uniquely identify such rows - it
should sufficient if we just update exactly one of the matching rows.

The way to do this is to put all fields of the OLD.* tuple in the WHERE
clause and then update just one matching row.

IIRC updating (or deleting) CURRENT OF a cursor is currently supported
only in pl/pgsql so this needs to be done using a plpgsql cursor.

If the table has no indexes or index lookup returns lots of rows, then
this is bound to be slow, but in this case it was probably slow on
master too :)

> -- 
> Jim C. Nasby, Database Architect                   jim@nasby.net
> 512.569.9461 (cell)                         http://jim.nasby.net
> 




Re: Future In-Core Replication

From
Robert Haas
Date:
On Fri, May 4, 2012 at 8:32 AM, Hannu Krosing <hannu@krosing.net> wrote:
> For logical we don't really need to uniquely identify such rows - it
> should sufficient if we just update exactly one of the matching rows.
>
> The way to do this is to put all fields of the OLD.* tuple in the WHERE
> clause and then update just one matching row.
>
> IIRC updating (or deleting) CURRENT OF a cursor is currently supported
> only in pl/pgsql so this needs to be done using a plpgsql cursor.
>
> If the table has no indexes or index lookup returns lots of rows, then
> this is bound to be slow, but in this case it was probably slow on
> master too :)

I was about to write a reply saying exactly this, but you said it
better than I would have been able to manage.

I think this is all exactly right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Future In-Core Replication

From
Robert Haas
Date:
On Thu, May 3, 2012 at 8:22 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> On 05/01/2012 09:09 AM, Robert Haas wrote:
>>
>> I think we ought to be sharing and debugging designs in
>> public, not internally within 2ndQuadrant - or any other company, or
>> any other mailing list other than this one.
>
> OK.  You go first.

Are you requesting more transparency in general, asking for my
thoughts on logical replication specifically, or something else?

I try pretty hard not to go off and do large amounts of work in a
vacuum.  If something is more than a couple days work, I post the
design on hackers and wait for feedback before writing a line of code.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Future In-Core Replication

From
Simon Riggs
Date:
On 4 May 2012 14:01, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, May 4, 2012 at 8:32 AM, Hannu Krosing <hannu@krosing.net> wrote:
>> For logical we don't really need to uniquely identify such rows - it
>> should sufficient if we just update exactly one of the matching rows.
>>
>> The way to do this is to put all fields of the OLD.* tuple in the WHERE
>> clause and then update just one matching row.
>>
>> IIRC updating (or deleting) CURRENT OF a cursor is currently supported
>> only in pl/pgsql so this needs to be done using a plpgsql cursor.
>>
>> If the table has no indexes or index lookup returns lots of rows, then
>> this is bound to be slow, but in this case it was probably slow on
>> master too :)
>
> I was about to write a reply saying exactly this, but you said it
> better than I would have been able to manage.
>
> I think this is all exactly right.

Yes, but its not a high priority for inclusion. Many things like this
will need to wait behind the really critical additional features.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Future In-Core Replication

From
Greg Smith
Date:
On 05/04/2012 09:03 AM, Robert Haas wrote:
> I try pretty hard not to go off and do large amounts of work in a
> vacuum.  If something is more than a couple days work, I post the
> design on hackers and wait for feedback before writing a line of code.

That is an excellent luxury to have.  You've worked very hard to earn 
it.  Not everyone is in the position where that's possible though.

> Are you requesting more transparency in general, asking for my
> thoughts on logical replication specifically, or something else?

The straw man argument here would require 100% transparency on 
everything you do in regards to PostgreSQL and related software.  Before 
doing any development on any code, first post here to ask for design 
review.  And if someone asks you to work on a program that isn't open 
source from day one, refuse unless you can operate that transparently.

That standard is nice if you can pull it off.  But I don't give you a 
hard time if you have to make some compromises from that ideal to keep 
yourself gainfully employed.  You do a ton of good work for the 
PostgreSQL community in a transparent way, so I assume that you're doing 
the best you can.  I would like to see that assumption presumed on our 
side, too.

Here are the individual straw men in this area I'd like to see put out 
of their misery:

"You're developing things in secret":  if that's the case, we're pretty 
bad at it, given the history I outlined at 
http://archives.postgresql.org/message-id/4F9B1B6C.5010300@2ndQuadrant.com

"That discussion didn't happen in the right place":  it's not our fault 
that the cluster-hackers list exists.  Go joust at getting that list 
shut down and their meeting during PGCon canceled if you think it's 
unproductive for discussions to happen there.  I've been trying to 
bridge that gap for over two years now; note how many times I appear in 
the edit history at 
http://wiki.postgresql.org/index.php?title=ClusterFeatures&action=history

"You might do too much development in the wrong direction and not build 
the right thing":  and?  Yes, there are people who develop into a corner 
and end up doing unproductive work as a result.  And there are others 
who submit things and give up when faced with feedback on them.  Last 
time I checked, there wasn't anyone who flat-out rejects on-list 
feedback working for 2ndQuadrant.   Instead, I see features that go 
through extensive and numerous review cycles based on what we hear back.

"Designs should be presented on-list before doing any development": 
this is not always practical for those of us who are doing feature 
development.  Some feature sponsors are still getting used to open 
development.  If we have a private development milestone date to hit *in 
order to get more funding for public PostgreSQL work*, which is often 
the case here, we try not to miss it.  We'd be bad community members to 
do so.  And sometimes that involves building a proof of concept or 
prototype here first, then submitting it to the community once it's 
moved onto being a proven concept.  Since the community has a clear set 
of guidelines for how and when to submit new features, we make sure the 
development plans line up with them.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


Re: Future In-Core Replication

From
Robert Haas
Date:
On Fri, May 4, 2012 at 11:06 AM, Greg Smith <greg@2ndquadrant.com> wrote:
> The straw man argument here would require 100% transparency on everything
> you do in regards to PostgreSQL and related software.  Before doing any
> development on any code, first post here to ask for design review.  And if
> someone asks you to work on a program that isn't open source from day one,
> refuse unless you can operate that transparently.

Well, look.  At the end of the day, I don't really care whether you
post your designs before writing code or not - unless it turns out
that we get to the end of the development cycle, a gigantic patch
shows up at the last minute, it gets rejected because people aren't
satisfied with the design, and then massive bitching ensues because
the author(s) put a lot of work into that patch.  Then I care, because
now the fact that no design consensus was sought at the outset has
been transformed into a defect in the community process, which does in
fact have defects, but that isn't one of them.  We all know that
design review is going to have to happen at some point, and if there's
not an adequate opportunity to do that before the code is written then
it will happen after the code is written.  If that means the code has
to be thrown out, then that's the risk you take by writing the code
first.  As long as everybody understands that, do it in whatever order
you like.

I think the real straw man here is the idea that it will somehow save
time to skip the design phase and start writing code.  I have never
worked on a project, open source or otherwise, where that was true,
and I believe that any textbook on software engineering you pick up is
likely to tell you that in fact exactly the opposite is the case.

Obviously, there are times when you need to write some throw-away code
just to see how things shake out, and I do that all the time, and it
makes complete sense, and I'm not knocking it.  But if any of that
code makes it into the committed patch, I count that as unusually
lucky.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Future In-Core Replication

From
Andres Freund
Date:
Hi Robert, Hi all,

On Friday, May 04, 2012 06:29:33 PM Robert Haas wrote:
> On Fri, May 4, 2012 at 11:06 AM, Greg Smith <greg@2ndquadrant.com> wrote:
> > The straw man argument here would require 100% transparency on everything
> > you do in regards to PostgreSQL and related software.  Before doing any
> > development on any code, first post here to ask for design review.  And
> > if someone asks you to work on a program that isn't open source from day
> > one, refuse unless you can operate that transparently.
> 
> Well, look.  At the end of the day, I don't really care whether you
> post your designs before writing code or not - unless it turns out
> that we get to the end of the development cycle, a gigantic patch
> shows up at the last minute, it gets rejected because people aren't
> satisfied with the design, and then massive bitching ensues because
> the author(s) put a lot of work into that patch.  Then I care, because
> now the fact that no design consensus was sought at the outset has
> been transformed into a defect in the community process, which does in
> fact have defects, but that isn't one of them.  We all know that
> design review is going to have to happen at some point, and if there's
> not an adequate opportunity to do that before the code is written then
> it will happen after the code is written.  If that means the code has
> to be thrown out, then that's the risk you take by writing the code
> first.  As long as everybody understands that, do it in whatever order
> you like.
In my understanding - as the person doing quite a bit of the coding atm - the 
point is to provide a very minimal *early* prototype to have a sensible basis 
for design decisions/discussions. On one side thats useful to get a feeling 
for the problems involved. On the other side doing design discussions without 
an underlaying basic patch & design on -hackers tends to often go into 
directions of feature creep and bikeshedding. It also helps against "this is 
impossible" claims.
Parts of this thread and related ones are a somewhat good example of this.

The plan is to show the early prototype around pgcon and send design documents 
and split-up patches (of that prototype) a holiday and some cleanup later to -
hackers. I/We aim to have individual, independently usable, parts of the patch 
submitted to the first 9.3 commitfest.

I definitely do not wish to screw anyone over doing this or such. And I am 
sure thats the same with others working on this. Even if sometimes emotions 
get in the way/into play.

Greetings,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: Future In-Core Replication

From
Robert Haas
Date:
On Fri, May 4, 2012 at 12:59 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> In my understanding - as the person doing quite a bit of the coding atm - the
> point is to provide a very minimal *early* prototype to have a sensible basis
> for design decisions/discussions. On one side thats useful to get a feeling
> for the problems involved. On the other side doing design discussions without
> an underlaying basic patch & design on -hackers tends to often go into
> directions of feature creep and bikeshedding. It also helps against "this is
> impossible" claims.
> Parts of this thread and related ones are a somewhat good example of this.
>
> The plan is to show the early prototype around pgcon and send design documents
> and split-up patches (of that prototype) a holiday and some cleanup later to -
> hackers. I/We aim to have individual, independently usable, parts of the patch
> submitted to the first 9.3 commitfest.
>
> I definitely do not wish to screw anyone over doing this or such. And I am
> sure thats the same with others working on this. Even if sometimes emotions
> get in the way/into play.

Thanks.  I think we are basically all on the same page, here.  As I
said upthread, I am not trying to accuse anything of doing anything
wrong, and as you and Simon and Greg have all said, you're not trying
to do anything wrong.  So, no problem.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company