Thread: Re: xReader, double-effort (was: Temporary tables under hot standby)

Re: xReader, double-effort (was: Temporary tables under hot standby)

From

Aakash Goel

Date:

27 April 2012, 16:50:26

<div class="gmail_extra">Sure Kevin, will get the wiki page ready asap, and reply back. Thanks.<br /><br /><div
class="gmail_quote">OnThu, Apr 26, 2012 at 8:10 PM, Kevin Grittner <span dir="ltr"><<a
href="mailto:Kevin.Grittner@wicourts.gov"target="_blank">Kevin.Grittner@wicourts.gov</a>></span> wrote:<br
/><blockquoteclass="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">[resending
becauseof <a href="http://postgresql.org" target="_blank">postgresql.org</a> bounces on first try]<br /><br /> Simon
Riggs<<a href="mailto:simon@2ndquadrant.com">simon@2ndquadrant.com</a>> wrote:<br /> > Kevin Grittner <<a
href="mailto:Kevin.Grittner@wicourts.gov">Kevin.Grittner@wicourts.gov</a>>wrote:<br /><br /> >> The GSoC
xReaderproject is intended to be a major step toward<br /> >> that, by providing a way to translate the WAL
streamto a series<br /> >> of notifications of logical events to clients which register with<br /> >>
xReader.<br/> ><br /> > This is already nearly finished in prototype and will be published<br /> > in May.
AndresFreund is working on it, copied here.<br /><br /> URL?<br /><br /> > It looks like there is significant
overlapthere.<br /><br /> Hard for me to know without more information.  It sounds like there<br /> is at least some
overlap. I hope that can involve cooperation, with<br /> the efforts of Andres forming the basis of Aakash's GSoC
effort.<br/> That might leave him more time to polish up the user filters.<br /><br /> Aakash: It seems like we need
thatWiki page rather sooner than<br /> later.  Can you get to that quickly?  I would think that just<br /> copying the
textfrom your approved GSoC proposal would be a very<br /> good start.  If you need help figuring out how to embed the
images<br/> from your proposal, let me know.<br /><span class="HOEnZb"><font color="#888888"><br /> -Kevin<br
/></font></span></blockquote></div><br/></div>

Re: xReader, double-effort (was: Temporary tables under hot standby)

From

Aakash Goel

Date:

27 April 2012, 17:50:15

All, the wiki page is now up at http://wiki.postgresql.org/wiki/XReader.

On Sat, Apr 28, 2012 at 1:19 AM, Aakash Goel <aakash.bits@gmail.com> wrote:

Sure Kevin, will get the wiki page ready asap, and reply back. Thanks.

On Thu, Apr 26, 2012 at 8:10 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
[resending because of postgresql.org bounces on first try]

Simon Riggs <simon@2ndquadrant.com> wrote:
> Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:

>> The GSoC xReader project is intended to be a major step toward
>> that, by providing a way to translate the WAL stream to a series
>> of notifications of logical events to clients which register with
>> xReader.
>
> This is already nearly finished in prototype and will be published
> in May. Andres Freund is working on it, copied here.

URL?

> It looks like there is significant overlap there.

Hard for me to know without more information. It sounds like there
is at least some overlap. I hope that can involve cooperation, with
the efforts of Andres forming the basis of Aakash's GSoC effort.
That might leave him more time to polish up the user filters.

Aakash: It seems like we need that Wiki page rather sooner than
later. Can you get to that quickly? I would think that just
copying the text from your approved GSoC proposal would be a very
good start. If you need help figuring out how to embed the images
from your proposal, let me know.

-Kevin

Re: xReader, double-effort (was: Temporary tables under hot standby)

From

"Kevin Grittner"

Date:

27 April 2012, 18:04:25

[replaced bad email address for Josh (which was my fault)] 
Aakash Goel <aakash.bits@gmail.com> wrote: 
> All, the wiki page is now up at
>  http://wiki.postgresql.org/wiki/XReader.
Note that the approach Aakash is taking doesn't involve changes to
the backend code, it is strictly a standalone executable to which
functions as a proxy to a hot standby and to which clients like
replications systems connect.  There is a possible additional
configuration which wouldn't require a hot standby, if time permits.
I am not clear on whether 2nd Quadrant's code takes this approach
or builds it into the server.  I think we need to know that much
before we can get very far in discussion.
-Kevin

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

From

Andres Freund

Date:

27 April 2012, 18:43:02

On Friday, April 27, 2012 11:04:04 PM Kevin Grittner wrote:
> [replaced bad email address for Josh (which was my fault)]
> 
> Aakash Goel <aakash.bits@gmail.com> wrote:
> > All, the wiki page is now up at
> > 
> >  http://wiki.postgresql.org/wiki/XReader.
> 
> Note that the approach Aakash is taking doesn't involve changes to
> the backend code, it is strictly a standalone executable to which
> functions as a proxy to a hot standby and to which clients like
> replications systems connect.  There is a possible additional
> configuration which wouldn't require a hot standby, if time permits.
> I am not clear on whether 2nd Quadrant's code takes this approach
> or builds it into the server.  I think we need to know that much
> before we can get very far in discussion.
In the current, prototypal, state there is one component thats integrated into 
the server (because it needs information thats only available there). That 
component is layered ontop of a totally generic xlog reading/parsing library 
that doesn't care at all where its running. Its also used in another cluster 
to read the received (filtered) stream.
I plan to submit the XLogReader (thats what its called atm) before everything 
else, so everybody can take a look as soon as possible.

I took a *very* short glance over the current wiki description of xReader and 
from that it seems to me it would benefit from trying to make it 
architecturally more similar to the rest of pg. I also would suggest reviewing 
how the current walreceiver/sender, and their protocol, work.

Andres


-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

From

"Kevin Grittner"

Date:

27 April 2012, 19:18:57

Andres Freund <andres@2ndquadrant.com> wrote:
> In the current, prototypal, state there is one component thats
> integrated into the server (because it needs information thats
> only available there).
The xReader design was based on the idea that it would be nice not
to cause load on the master machine, and that by proxying the WAL
stream to the HS, using synchronous replication style to write from
xReader to the HS, you could use the HS for a source for that data
with it being at exactly the right point in time to query it.
I'm not convinced that I would rather see the logic fixed inside the
master as opposed to being deployable on the master's machine, the
slave machine, or even on its own machine in between.
> That component is layered ontop of a totally generic xlog
> reading/parsing library that doesn't care at all where its
> running.
That's cool.
> Its also used in another cluster to read the received (filtered)
> stream.
I don't quite follow what you're saying there.
> I plan to submit the XLogReader (thats what its called atm)
> before everything else, so everybody can take a look as soon as
> possible.
Great!  That will allow more discussion and planning.
> I took a *very* short glance over the current wiki description of
> xReader and from that it seems to me it would benefit from trying
> to make it architecturally more similar to the rest of pg.
We're planning on using existing protocol to talk between pieces. 
Other than breaking it out so that it can run somewhere other than
inside the server, and allowing clients to connect to xReader to
listen to WAL events of interest, are you referring to anything
else?
> I also would suggest reviewing how the current walreceiver/sender,
> and their protocol, work.
Of course!  The first "inch-stone" in the GSoC project plan
basically consists of creating an executable that functions as a
walreceiver and a walsender to just pass things through from the
master to the slave.  We build from there by allowing clients to
connect (again, over existing protocol) and register for events of
interest, and then recognizing different WAL records to generate
events.  The project was just going to create a simple client to
dump the information to disk, but with the time saved by adopting
what you've already done, that might leave more time for generating
a useful client.
Aakash, when you get a chance, could you fill in the "inch-stones"
from the GSoC proposal page onto the Wiki page?  I think the
descriptions of those interim steps would help people understand
your proposal better.  Obviously, some of the particulars of tasks
and the dates may need adjustment based on the new work which is
expected to appear before you start, but what's there now would be a
helpful reference.
-Kevin

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

From

Andres Freund

Date:

27 April 2012, 19:33:01

Hi Kevin, Hi Aakash,

On Saturday, April 28, 2012 12:18:38 AM Kevin Grittner wrote:
> Andres Freund <andres@2ndquadrant.com> wrote:
> > In the current, prototypal, state there is one component thats
> > integrated into the server (because it needs information thats
> > only available there).
> The xReader design was based on the idea that it would be nice not
> to cause load on the master machine, and that by proxying the WAL
> stream to the HS, using synchronous replication style to write from
> xReader to the HS, you could use the HS for a source for that data
> with it being at exactly the right point in time to query it.
Yes, that does make sense for some workloads. I don't think its viable for 
everything though, thats why were not aiming for that ourselves atm.

> I'm not convinced that I would rather see the logic fixed inside the
> master as opposed to being deployable on the master's machine, the
> slave machine, or even on its own machine in between.
I don't think that you can do everything apart from the master. We currently 
need shared memory for coordination between the moving parts, thats why we 
have it inside the master.
It also have the advantage of being easier to setup.

> > That component is layered ontop of a totally generic xlog
> > reading/parsing library that doesn't care at all where its
> > running.
> That's cool.

> > Its also used in another cluster to read the received (filtered)
> > stream.
> I don't quite follow what you're saying there.
To interpret the xlog back into something that can be used for replication you 
need to read it again. After filtering we again write valid WAL, so we can use 
the same library on the sending|filtering side and on the receiving side.
But thats actually off topic for this thread ;)


> > I took a *very* short glance over the current wiki description of
> > xReader and from that it seems to me it would benefit from trying
> > to make it architecturally more similar to the rest of pg.
> We're planning on using existing protocol to talk between pieces.
> Other than breaking it out so that it can run somewhere other than
> inside the server, and allowing clients to connect to xReader to
> listen to WAL events of interest, are you referring to anything
> else?
It sounds like the xReader is designed to be one multiplexing process. While 
this definitely has some advantages resource-usage-wise it doesn't seem to be 
fitting the rest of the design that well. The advantages might outweigh 
everything else, but I am not sure about that.
Something like registering/deregistering also doesn't fit that well with the 
way walsender works as far as I understand it.

Greetings,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

From

"Kevin Grittner"

Date:

27 April 2012, 19:59:07

Andres Freund <andres@2ndquadrant.com> wrote:
> Something like registering/deregistering also doesn't fit that
> well with the way walsender works as far as I understand it.
If you look at the diagrams on the xReader Wiki page, the lines
labeled "XLOG stream" are the ones using walsender/walreceiver.  The
green arrows represent normal connections to the database, to run
queries to retrieve metadata needed to interpret the WAL records,
and the lines labeled "Listener n" are expected to use the pg
protocol to connect, but won't be talking page-oriented WAL -- they
will be dealing with logical interpretation of the WAL.  The sort of
data which could be fed to a database which doesn't have the same
page images.  Like Slony et al do.
Perhaps, given other points you made, the library for interpreting
the WAL records could be shared, and hopefully a protocol for the
clients, although that seems a lot more muddy to me at this point. 
If we can share enough code, there may be room for both approaches
with minimal code duplication.
-Kevin

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

From

Simon Riggs

Date:

28 April 2012, 05:04:48

On Fri, Apr 27, 2012 at 11:18 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Andres Freund <andres@2ndquadrant.com> wrote:

> I'm not convinced that I would rather see the logic fixed inside the
> master as opposed to being deployable on the master's machine, the
> slave machine, or even on its own machine in between.

There are use cases where the translation from WAL to logical takes
place on the master, the standby or other locations.

It's becoming clear that filtering records on the source is important
in high bandwidth systems, so the initial work focuses on putting that
on the "master", i.e. the source. Which was not my first thought
either. If you use cascading, this would still allow you to have
master -> standby -> logical.

Translating WAL is a very hard task. Some time ago, I did also think
an external tool would help (my initial design was called xfilter),
but I no longer think that is likely to work very well apart from very
simple cases.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

From

Aakash Goel

Date:

28 April 2012, 05:25:04

<div class="gmail_extra"><span style="style">> Aakash, when you get a chance, could you fill in the
"inch-stones"</span><brstyle="style" /><span style="style">> from the GSoC proposal page onto the Wiki
page?</span></div><divclass="gmail_extra"><font color="#222222" face="arial, sans-serif"><br /></font></div><div
class="gmail_extra">Sure, <a
href="http://wiki.postgresql.org/wiki/XReader">http://wiki.postgresql.org/wiki/XReader</a> updated.<fontcolor="#222222"
face="arial,sans-serif"><br /></font><br /><div class="gmail_quote">On Sat, Apr 28, 2012 at 3:48 AM, Kevin Grittner
<spandir="ltr"><<a href="mailto:Kevin.Grittner@wicourts.gov"
target="_blank">Kevin.Grittner@wicourts.gov</a>></span>wrote:<br /><blockquote class="gmail_quote" style="margin:0 0
0.8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">Andres Freund <<a
href="mailto:andres@2ndquadrant.com">andres@2ndquadrant.com</a>>wrote:<br /><br /> > In the current, prototypal,
statethere is one component thats<br /> > integrated into the server (because it needs information thats<br /> >
onlyavailable there).<br /><br /></div>The xReader design was based on the idea that it would be nice not<br /> to
causeload on the master machine, and that by proxying the WAL<br /> stream to the HS, using synchronous replication
styleto write from<br /> xReader to the HS, you could use the HS for a source for that data<br /> with it being at
exactlythe right point in time to query it.<br /><br /> I'm not convinced that I would rather see the logic fixed
insidethe<br /> master as opposed to being deployable on the master's machine, the<br /> slave machine, or even on its
ownmachine in between.<br /><div class="im"><br /> > That component is layered ontop of a totally generic xlog<br />
>reading/parsing library that doesn't care at all where its<br /> > running.<br /><br /></div>That's cool.<br
/><divclass="im"><br /> > Its also used in another cluster to read the received (filtered)<br /> > stream.<br
/><br/></div>I don't quite follow what you're saying there.<br /><div class="im"><br /> > I plan to submit the
XLogReader(thats what its called atm)<br /> > before everything else, so everybody can take a look as soon as<br />
>possible.<br /><br /></div>Great!  That will allow more discussion and planning.<br /><div class="im"><br /> > I
tooka *very* short glance over the current wiki description of<br /> > xReader and from that it seems to me it would
benefitfrom trying<br /> > to make it architecturally more similar to the rest of pg.<br /><br /></div>We're
planningon using existing protocol to talk between pieces.<br /> Other than breaking it out so that it can run
somewhereother than<br /> inside the server, and allowing clients to connect to xReader to<br /> listen to WAL events
ofinterest, are you referring to anything<br /> else?<br /><div class="im"><br /> > I also would suggest reviewing
howthe current walreceiver/sender,<br /> > and their protocol, work.<br /><br /></div>Of course!  The first
"inch-stone"in the GSoC project plan<br /> basically consists of creating an executable that functions as a<br />
walreceiverand a walsender to just pass things through from the<br /> master to the slave.  We build from there by
allowingclients to<br /> connect (again, over existing protocol) and register for events of<br /> interest, and then
recognizingdifferent WAL records to generate<br /> events.  The project was just going to create a simple client to<br
/>dump the information to disk, but with the time saved by adopting<br /> what you've already done, that might leave
moretime for generating<br /> a useful client.<br /><br /> Aakash, when you get a chance, could you fill in the
"inch-stones"<br/> from the GSoC proposal page onto the Wiki page?  I think the<br /> descriptions of those interim
stepswould help people understand<br /> your proposal better.  Obviously, some of the particulars of tasks<br /> and
thedates may need adjustment based on the new work which is<br /> expected to appear before you start, but what's there
nowwould be a<br /> helpful reference.<br /><span class="HOEnZb"><font color="#888888"><br /> -Kevin<br
/></font></span></blockquote></div><br/></div>

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

From

Aakash Goel

Date:

28 April 2012, 05:29:58

<div class="gmail_extra">Hello Andres,</div><div class="gmail_extra"><br /></div><div class="gmail_extra"><div
class="im"style="style">>> The xReader design was based on the idea that it would be nice not<br />>> to
causeload on the master machine, and that by proxying the WAL<br /> >> stream to the HS, using synchronous
replicationstyle to write from<br />>> xReader to the HS, you could use the HS for a source for that data<br
/>>>with it being at exactly the right point in time to query it.<br /></div><span style="style">>Yes, that
doesmake sense for some workloads. I don't think its viable for</span><br style="style" /><span
style="style">>everythingthough, thats why were not aiming for that ourselves atm.</span> </div><div
class="gmail_extra"><br/></div><div class="gmail_extra">Regarding the above, what would be a case where querying the HS
willnot suffice?<br /><br /><div class="gmail_quote">On Sat, Apr 28, 2012 at 4:02 AM, Andres Freund <span
dir="ltr"><<ahref="mailto:andres@2ndquadrant.com" target="_blank">andres@2ndquadrant.com</a>></span> wrote:<br
/><blockquoteclass="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Kevin, Hi
Aakash,<br/><div class="im"><br /> On Saturday, April 28, 2012 12:18:38 AM Kevin Grittner wrote:<br /> > Andres
Freund<<a href="mailto:andres@2ndquadrant.com">andres@2ndquadrant.com</a>> wrote:<br /> > > In the current,
prototypal,state there is one component thats<br /> > > integrated into the server (because it needs information
thats<br/> > > only available there).<br /> > The xReader design was based on the idea that it would be nice
not<br/> > to cause load on the master machine, and that by proxying the WAL<br /> > stream to the HS, using
synchronousreplication style to write from<br /> > xReader to the HS, you could use the HS for a source for that
data<br/> > with it being at exactly the right point in time to query it.<br /></div>Yes, that does make sense for
someworkloads. I don't think its viable for<br /> everything though, thats why were not aiming for that ourselves
atm.<br/><div class="im"><br /> > I'm not convinced that I would rather see the logic fixed inside the<br /> >
masteras opposed to being deployable on the master's machine, the<br /> > slave machine, or even on its own machine
inbetween.<br /></div>I don't think that you can do everything apart from the master. We currently<br /> need shared
memoryfor coordination between the moving parts, thats why we<br /> have it inside the master.<br /> It also have the
advantageof being easier to setup.<br /><div class="im"><br /> > > That component is layered ontop of a totally
genericxlog<br /> > > reading/parsing library that doesn't care at all where its<br /> > > running.<br />
>That's cool.<br /><br /> > > Its also used in another cluster to read the received (filtered)<br /> > >
stream.<br/> > I don't quite follow what you're saying there.<br /></div>To interpret the xlog back into something
thatcan be used for replication you<br /> need to read it again. After filtering we again write valid WAL, so we can
use<br/> the same library on the sending|filtering side and on the receiving side.<br /> But thats actually off topic
forthis thread ;)<br /><div class="im"><br /><br /> > > I took a *very* short glance over the current wiki
descriptionof<br /> > > xReader and from that it seems to me it would benefit from trying<br /> > > to make
itarchitecturally more similar to the rest of pg.<br /> > We're planning on using existing protocol to talk between
pieces.<br/> > Other than breaking it out so that it can run somewhere other than<br /> > inside the server, and
allowingclients to connect to xReader to<br /> > listen to WAL events of interest, are you referring to anything<br
/>> else?<br /></div>It sounds like the xReader is designed to be one multiplexing process. While<br /> this
definitelyhas some advantages resource-usage-wise it doesn't seem to be<br /> fitting the rest of the design that well.
Theadvantages might outweigh<br /> everything else, but I am not sure about that.<br /> Something like
registering/deregisteringalso doesn't fit that well with the<br /> way walsender works as far as I understand it.<br
/><br/> Greetings,<br /><div class="HOEnZb"><div class="h5"><br /> Andres<br /> --<br />  Andres Freund                
   <a href="http://www.2ndQuadrant.com/" target="_blank">http://www.2ndQuadrant.com/</a><br />  PostgreSQL Development,
24x7Support, Training & Services<br /></div></div></blockquote></div><br /></div>

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

From

Tom Lane

Date:

28 April 2012, 12:07:17

Simon Riggs <simon@2ndQuadrant.com> writes:
> Translating WAL is a very hard task.

No kidding.  I would think it's impossible on its face.  Just for
starters, where will you get table and column names from?  (Looking at
the system catalogs is cheating, and will not work reliably anyway.)

IMO, if we want non-physical replication, we're going to need to build
it in at a higher level than after-the-fact processing of WAL.
I foresee wasting quite a lot of effort on the currently proposed
approaches before we admit that they're unworkable.
        regards, tom lane

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

From

Robert Haas

Date:

29 April 2012, 17:33:28

On Sat, Apr 28, 2012 at 11:06 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
>> Translating WAL is a very hard task.
>
> No kidding.  I would think it's impossible on its face.  Just for
> starters, where will you get table and column names from?  (Looking at
> the system catalogs is cheating, and will not work reliably anyway.)
>
> IMO, if we want non-physical replication, we're going to need to build
> it in at a higher level than after-the-fact processing of WAL.
> I foresee wasting quite a lot of effort on the currently proposed
> approaches before we admit that they're unworkable.

I think the question we should be asking ourselves is not whether WAL
as it currently exists is adequate for logical replication, but rather
or not it could be made adequate.  For example, suppose that we were
to arrange things so that, after each checkpoint, the first insert,
update, or delete record for a given relfilenode after each checkpoint
emits a special WAL record that contains the relation name, schema
OID, attribute names, and attribute type OIDs.  Well, now we are much
closer to being able to do some meaningful decoding of the tuple data,
and it really doesn't cost us that much.  Handling DDL (and manual
system catalog modifications) seems pretty tricky, but I'd be very
reluctant to give up on it without banging my head against the wall
pretty hard.  The trouble with giving up on WAL completely and moving
to a separate replication log is that it means a whole lot of
additional I/O, which is bound to have a negative effect on
performance.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

From

Hannu Krosing

Date:

29 April 2012, 19:01:10

On Sun, 2012-04-29 at 16:33 -0400, Robert Haas wrote:
> On Sat, Apr 28, 2012 at 11:06 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Simon Riggs <simon@2ndQuadrant.com> writes:
> >> Translating WAL is a very hard task.
> >
> > No kidding.  I would think it's impossible on its face.  Just for
> > starters, where will you get table and column names from?  (Looking at
> > the system catalogs is cheating, and will not work reliably anyway.)
> >
> > IMO, if we want non-physical replication, we're going to need to build
> > it in at a higher level than after-the-fact processing of WAL.
> > I foresee wasting quite a lot of effort on the currently proposed
> > approaches before we admit that they're unworkable.
> 
> I think the question we should be asking ourselves is not whether WAL
> as it currently exists is adequate for logical replication, but rather
> or not it could be made adequate.  

Agreed. 

> For example, suppose that we were
> to arrange things so that, after each checkpoint, the first insert,
> update, or delete record for a given relfilenode after each checkpoint
> emits a special WAL record that contains the relation name, schema
> OID, attribute names, and attribute type OIDs.  

Not just the first after checkpoint, but also the first after a schema
change, even though will duplicate the wals with changes to system
catalog, it is likely much cheaper overall to always have a fresh
structure in wal stream.

And if we really want to do WAL-->logical-->SQL_text conversion on a
host separate from the master, we also need to insert there the type
definitions of user-defined types together with at least types output
functions in some form .

So you basically need a large part of postgres for reliably making sense
of WAL.

> Well, now we are much
> closer to being able to do some meaningful decoding of the tuple data,
> and it really doesn't cost us that much.  Handling DDL (and manual
> system catalog modifications) seems pretty tricky, but I'd be very
> reluctant to give up on it without banging my head against the wall
> pretty hard. 

Most straightforward way is to have a more or less full copy of
pg_catalog also on the "WAL-filtering / WAL-conversion" node, and to use
it in 1:1 replicas of transactions recreated from the WAL .
This way we can avoid recreating any alternate views of the masters
schema.

Then again, we could do it all on master and inside the wal-writing
transaction and thus avoid large chunk of the problems.

If the receiving side is also PostgreSQL with same catalog structure
(i.e same major version) then we don't actually need to "handle DDL" in
any complicated way, it would be enough to just carry over the changes
to system tables .

The main reason we don't do it currently for trigger-based logical
replication is the restriction of not being able to have triggers on
system tables. 

I hope it is much easier to have the triggerless record generation also
work on system tables.

> The trouble with giving up on WAL completely and moving
> to a separate replication log is that it means a whole lot of
> additional I/O, which is bound to have a negative effect on
> performance.

Why would you give up WAL ?

Or do you mean that the new "logical-wal" needs to have same commit time
behaviour as WAL to be reliable ?

I'd envision a scenario where the logi-wal is sent to slave or
distribution hub directly and not written at the local host at all. 
An optionally sync mode similar to current sync WAL replication could be
configured. I hope this would run mostly in parallel with local WAL
generation so not much extra wall-clock time would be wasted.

> -- 
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
> 

-- 
-------
Hannu Krosing
PostgreSQL Unlimited Scalability and Performance Consultant
2ndQuadrant Nordic
PG Admin Book: http://www.2ndQuadrant.com/books/

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

From

Robert Haas

Date:

30 April 2012, 00:26:20

On Sun, Apr 29, 2012 at 6:00 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote:
>> I think the question we should be asking ourselves is not whether WAL
>> as it currently exists is adequate for logical replication, but rather
>> or not it could be made adequate.
>
> Agreed.

And of course I meant "but rather whether or not it could be made
adequate", but I dropped a word.

>> For example, suppose that we were
>> to arrange things so that, after each checkpoint, the first insert,
>> update, or delete record for a given relfilenode after each checkpoint
>> emits a special WAL record that contains the relation name, schema
>> OID, attribute names, and attribute type OIDs.
>
> Not just the first after checkpoint, but also the first after a schema
> change, even though will duplicate the wals with changes to system
> catalog, it is likely much cheaper overall to always have a fresh
> structure in wal stream.

Yes.

> And if we really want to do WAL-->logical-->SQL_text conversion on a
> host separate from the master, we also need to insert there the type
> definitions of user-defined types together with at least types output
> functions in some form .

Yes.

> So you basically need a large part of postgres for reliably making sense
> of WAL.

Agreed, but I think that's a problem we need to fix and not a
tolerable situation at all.  If a user can create a type-output
function that goes and looks at the state of the database to determine
what to output, then we are completely screwed, because that basically
means you would need to have a whole Hot Standby instance up and
running just to make it possible to run type output functions.  Now
you might be able to build a mechanism around that that is useful to
some people in some situations, but wow does that sound painful.  What
I want is for the master to be able to cheaply rattle off the tuples
that got inserted, updated, or deleted as those things happen; needing
a whole second copy of the database just to do that does not meet my
definition of "cheap".  Furthermore, it's not really clear that it's
sufficient anyway, since there are problems with what happens before
the HS instance reaches consistency, what happens when it crashes and
restarts, and how do we handle the case when the system catalog we
need to examine to generate the logical replication records is
access-exclusive-locked?  Seems like a house of cards.

Some of this might be possible to mitigate contractually, by putting
limits on what type input/output functions are allowed to do.  Or we
could invent a new analog of type input/output functions that is
explicitly limited in this way, and support only types that provide
it.  But I think the real key is that we can't rely on catalog access:
the WAL stream has to have enough information to allow the reader to
construct some set of in-memory hash tables with sufficient detail to
reliably decode WAL.  Or at least that's what I'm thinking.

> Most straightforward way is to have a more or less full copy of
> pg_catalog also on the "WAL-filtering / WAL-conversion" node, and to use
> it in 1:1 replicas of transactions recreated from the WAL .
> This way we can avoid recreating any alternate views of the masters
> schema.

See above; I have serious doubts that this can ever be made to work robustly.

> Then again, we could do it all on master and inside the wal-writing
> transaction and thus avoid large chunk of the problems.
>
> If the receiving side is also PostgreSQL with same catalog structure
> (i.e same major version) then we don't actually need to "handle DDL" in
> any complicated way, it would be enough to just carry over the changes
> to system tables .

I agree it'd be preferable to handle DDL in terms of system catalog
updates, rather than saying, well, this is an ALTER TABLE .. RENAME.
But you need to be able to decode tuples using the right tuple
descriptor, even while that's changing under you.

> Why would you give up WAL ?

For lack of ability to make it work.  Don't underestimate how hard
it's going to nail this down.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

From

Tom Lane

Date:

30 April 2012, 00:29:53

Robert Haas <robertmhaas@gmail.com> writes:
> On Sun, Apr 29, 2012 at 6:00 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote:
>> So you basically need a large part of postgres for reliably making sense
>> of WAL.

> Agreed, but I think that's a problem we need to fix and not a
> tolerable situation at all.  If a user can create a type-output
> function that goes and looks at the state of the database to determine
> what to output, then we are completely screwed, because that basically
> means you would need to have a whole Hot Standby instance up and
> running just to make it possible to run type output functions.

You mean like enum_out?  Or for that matter array_out, record_out,
range_out?
        regards, tom lane

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

From

Robert Haas

Date:

30 April 2012, 00:33:43

On Sun, Apr 29, 2012 at 11:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Sun, Apr 29, 2012 at 6:00 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote:
>>> So you basically need a large part of postgres for reliably making sense
>>> of WAL.
>
>> Agreed, but I think that's a problem we need to fix and not a
>> tolerable situation at all.  If a user can create a type-output
>> function that goes and looks at the state of the database to determine
>> what to output, then we are completely screwed, because that basically
>> means you would need to have a whole Hot Standby instance up and
>> running just to make it possible to run type output functions.
>
> You mean like enum_out?  Or for that matter array_out, record_out,
> range_out?

Yeah, exactly.  :-(

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company