Thread: logical changeset generation v6

logical changeset generation v6

From
Andres Freund
Date:
Hi!

Attached you can find the newest version of the logical changeset
generation patchset. Reduced by a couple of patches because the have
been committed last round. Hurray! and thanks.

The explanation of how to use the patch from last time:
http://archives.postgresql.org/message-id/20130614224817.GA19641%40awork2.anarazel.de
still holds true, so I am not going to repeat it here.

The individual patches are:
0001 wal_decoding: Allow walsender's to connect to a specific database
    One logical decoding operation can only decode content from one
    database at a time. Because of that the walsender needs to connect
    to a specific database. The earlier "replication=on/off" parameter
    now also has a valid parameter "database" which allows that.

0002 wal_decoding: Log xl_running_xact's at a higher frequency than checkpoints are done
    Imo relatively unproblematic and even useful without changeset extraction.

0003 wal_decoding: Add information about a tables primary key to struct RelationData
    Not much comments on this in the past. Kevin thinks we might want to
    choose the best candidate key in a more elaborate manner.

0004 wal_decoding: Introduce wal decoding via catalog timetravel
    The actual feature. Got cleaned up and shrunk since the last submission.

0005 wal_decoding: test_decoding: Add a simple decoding module in contrib
    Example output plugin that's also used for testing.

0006 wal_decoding: pg_receivellog: Introduce pg_receivexlog equivalent for logical changes
    Commandline utility to receive the changestream and manipulate slots.

0007 wal_decoding: test_logical_decoding: Add extension for easier testing of logical decoding
    Allows to not only create and destroy logical slots which is part of
    0005, but also receive the changestream via an SQL SRF.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: logical changeset generation v6

From
Peter Eisentraut
Date:
What's with 0001-Improve-regression-test-for-8410.patch?  Did you mean
to include that?




Re: logical changeset generation v6

From
Andres Freund
Date:
On 2013-09-15 10:03:54 -0400, Peter Eisentraut wrote:
> What's with 0001-Improve-regression-test-for-8410.patch?  Did you mean
> to include that?

Gah, no. That's already committed and unrelated. Stupid wildcard.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6

From
Peter Eisentraut
Date:
On Sat, 2013-09-14 at 22:49 +0200, Andres Freund wrote:
> Attached you can find the newest version of the logical changeset
> generation patchset.

You probably have bigger things to worry about, but please check the
results of cpluspluscheck, because some of the header files don't
include header files they depend on.

(I guess that's really pgcompinclude's job to find out, but
cpluspluscheck seems to be easier to use.)





Re: logical changeset generation v6

From
Andres Freund
Date:
On 2013-09-15 11:20:20 -0400, Peter Eisentraut wrote:
> On Sat, 2013-09-14 at 22:49 +0200, Andres Freund wrote:
> > Attached you can find the newest version of the logical changeset
> > generation patchset.
> 
> You probably have bigger things to worry about, but please check the
> results of cpluspluscheck, because some of the header files don't
> include header files they depend on.

Hm. I tried to get that right, but it's been a while since I last
checked. I don't regularly use cpluspluscheck because it doesn't work in
VPATH builds... We really need to fix that.

I'll push a fix for that to the git tree, don't think that's worth a
resend in itself.

Thanks,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6

From
Peter Eisentraut
Date:
On 9/15/13 11:30 AM, Andres Freund wrote:
> On 2013-09-15 11:20:20 -0400, Peter Eisentraut wrote:
>> On Sat, 2013-09-14 at 22:49 +0200, Andres Freund wrote:
>>> Attached you can find the newest version of the logical changeset
>>> generation patchset.
>>
>> You probably have bigger things to worry about, but please check the
>> results of cpluspluscheck, because some of the header files don't
>> include header files they depend on.
> 
> Hm. I tried to get that right, but it's been a while since I last
> checked. I don't regularly use cpluspluscheck because it doesn't work in
> VPATH builds... We really need to fix that.
> 
> I'll push a fix for that to the git tree, don't think that's worth a
> resend in itself.

This patch set now fails to apply because of the commit "Rename various
"freeze multixact" variables".




Re: logical changeset generation v6

From
Andres Freund
Date:
On 2013-09-17 09:45:28 -0400, Peter Eisentraut wrote:
> On 9/15/13 11:30 AM, Andres Freund wrote:
> > On 2013-09-15 11:20:20 -0400, Peter Eisentraut wrote:
> >> On Sat, 2013-09-14 at 22:49 +0200, Andres Freund wrote:
> >>> Attached you can find the newest version of the logical changeset
> >>> generation patchset.
> >>
> >> You probably have bigger things to worry about, but please check the
> >> results of cpluspluscheck, because some of the header files don't
> >> include header files they depend on.
> >
> > Hm. I tried to get that right, but it's been a while since I last
> > checked. I don't regularly use cpluspluscheck because it doesn't work in
> > VPATH builds... We really need to fix that.
> >
> > I'll push a fix for that to the git tree, don't think that's worth a
> > resend in itself.
>
> This patch set now fails to apply because of the commit "Rename various
> "freeze multixact" variables".

And I am even partially guilty for that patch...

Rebased patches attached.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: logical changeset generation v6

From
Fujii Masao
Date:
On Tue, Sep 17, 2013 at 11:31 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-09-17 09:45:28 -0400, Peter Eisentraut wrote:
>> On 9/15/13 11:30 AM, Andres Freund wrote:
>> > On 2013-09-15 11:20:20 -0400, Peter Eisentraut wrote:
>> >> On Sat, 2013-09-14 at 22:49 +0200, Andres Freund wrote:
>> >>> Attached you can find the newest version of the logical changeset
>> >>> generation patchset.
>> >>
>> >> You probably have bigger things to worry about, but please check the
>> >> results of cpluspluscheck, because some of the header files don't
>> >> include header files they depend on.
>> >
>> > Hm. I tried to get that right, but it's been a while since I last
>> > checked. I don't regularly use cpluspluscheck because it doesn't work in
>> > VPATH builds... We really need to fix that.
>> >
>> > I'll push a fix for that to the git tree, don't think that's worth a
>> > resend in itself.
>>
>> This patch set now fails to apply because of the commit "Rename various
>> "freeze multixact" variables".
>
> And I am even partially guilty for that patch...
>
> Rebased patches attached.

When I applied all the patches and do the compile, I got the following error:

gcc -O0 -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -g -I. -I../../../../src/include -D_GNU_SOURCE   -c -o
snapbuild.o snapbuild.c
snapbuild.c:187: error: redefinition of typedef 'SnapBuild'
../../../../src/include/replication/snapbuild.h:45: note: previous
declaration of 'SnapBuild' was here
make[4]: *** [snapbuild.o] Error 1


When I applied only
0001-wal_decoding-Allow-walsender-s-to-connect-to-a-speci.patch,
compiled the source, and set up the asynchronous replication, I got
the segmentation
fault.
   LOG:  server process (PID 12777) was terminated by signal 11:
Segmentation fault

Regards,

-- 
Fujii Masao



Re: logical changeset generation v6

From
Andres Freund
Date:
On 2013-09-19 14:08:36 +0900, Fujii Masao wrote:
> When I applied all the patches and do the compile, I got the following error:
>
> gcc -O0 -Wall -Wmissing-prototypes -Wpointer-arith
> -Wdeclaration-after-statement -Wendif-labels
> -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
> -fwrapv -g -I. -I../../../../src/include -D_GNU_SOURCE   -c -o
> snapbuild.o snapbuild.c
> snapbuild.c:187: error: redefinition of typedef 'SnapBuild'
> ../../../../src/include/replication/snapbuild.h:45: note: previous
> declaration of 'SnapBuild' was here
> make[4]: *** [snapbuild.o] Error 1

Hm. Somebody had reported that previously and I tried to fix it but
obviously I failed. Unfortunately I don't see that warning in any of the
gcc versions I have tried locally.

Hopefully fixed.

> When I applied only
> 0001-wal_decoding-Allow-walsender-s-to-connect-to-a-speci.patch,
> compiled the source, and set up the asynchronous replication, I got
> the segmentation
> fault.

Fixed, I mismerged something, sorry for that.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: logical changeset generation v6

From
Robert Haas
Date:
On Tue, Sep 17, 2013 at 10:31 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> Rebased patches attached.

I spent a bit of time looking at these patches yesterday and today.
It seems to me that there's a fair amount of stylistic cleanup that is
still needed, and some pretty bad naming choices, and some FIXMEs that
should probably be fixed, but for an initial review email it seems
more helpful to focus on high-level design, so here goes.

- Looking specifically at the 0004 patch, I think that the
RecentGlobalDataXmin changes are logically separate from the rest of
the patch, and that if we're going to commit them at all, it should be
separate from the rest of this.  I think this is basically a
performance optimization.  AFAICS, there's no correctness problem with
the idea of continuing to maintain a single RecentGlobalXmin; it's
just that you'll potentially end up with quite a lot of bloat.  But
that's not an argument that those two things HAVE to be committed
together; either could be done first, and independently of the other.
Also, these changes are quite complex and it's different to form a
judgement as to whether this idea is safe when they are intermingled
with the rest of the logical replication stuff.

More generally, the thing that bugs me about this approach is that
logical replication is not really special, but what you've done here
MAKES it special. There are plenty of other situations where we are
too aggressive about holding back xmin.  A single-statement
transaction that selects from a single table holds back xmin for the
entire cluster, and that is Very Bad.  It would probably be unfair to
say, well, you have to solve that problem first.  But on the other
hand, how do we know that the approach you've adopted here isn't going
to make the more general problem harder to solve?  It seems invasive
at a pretty low level.  I think we should at least spend some time
thinking about what *general* solutions to this problem would like
like and then decide whether this is approach is likely to be
forward-compatible with those solutions.

- There are no changes to the "doc" directory.  Obviously, if you're
going to add a new value for the wal_level GUC, it's gonna need to be
documented. Similarly, pg_receivellog needs to be documented.  In all
likelihood, you also need a whole chapter providing general background
on this technology.  A couple of README files is not going to do it,
and those files aren't suitable for check-in anyway (e.g. DESIGN.txt
makes reference to a URL where the current version of some patch can
be found; that's not appropriate for permanent documentation).  But
aside from that, what we really need here is user documentation, not
developer documentation.  I can perhaps pass judgement on whether the
guts of this functionality do things that are fundamentally unsafe,
but whether the user interface is good or bad is a question that
deserves broader input, and without documentation, most people aren't
going to understand it well enough to know whether they like it.  And
TBH, *I* don't really want to reverse-engineer what pg_receivellog
does from a read-through of the code, either.

- Given that we don't reassemble transactions until commit time, why
do we need to to ensure that XIDs are logged before their sub-XIDs
appear in WAL?  As I've said previously, I actually think that
on-the-fly reassembly is probably going to turn out to be very
important.  But if we're not getting that, do we really need this?
Also, while I'm trying to keep this email focused on high-level
concerns, I have to say that guaranteedlyLogged has got to be one of
the worst variable names I've ever seen, starting (but not ending)
with the fact that guaranteedly is not a word.  I'm also tempted to
say that all of the wal_level=logical changes should be split out as
their own patch, separate from the decoding stuff.  Personally, I
would find that much easier to review, although I admit it's less
critical here than for the RecentGlobalDataXmin stuff.

- If XLOG_HEAP_INPLACE is not decoded, doesn't that mean that this
facility can't be used to replicate a system catalog table?  Is that a
restriction we should enforce/document somehow?

- The new code is rather light on comments.  decode.c is extremely
light. For example, consider the function DecodeAbort(), which
contains the following comment:

+       /*
+        * this is a bit grotty, but if we're "faking" an abort we've
already gone
+        * through
+        */

Well, I have no idea what that means.  I'm sure you do, but I bet the
next person who isn't you that tries to understand this probably
won't.  It's also probably true that I could figure it out if I spent
more time on it, but I think the point of comments is to keep the
amount of time that must be spent trying to understand code to a
manageable level.  Generally, I'd suggest that any non-trivial
functions in these files should have a header comment explaining what
their job is; e.g. for DecodeStandbyOp you could write something like
"Decode an RM_STANDBY WAL record.  Currently, we only care about
XLOG_RUNNING_XACTS records, which tell us about transactions that may
have aborted when without writing an explicit abort record."  Or
whatever the right explanation is.  And then particularly tricky bits
should have their own comments.

- It still bothers me that we're going to have mandatory slots for
logical replication and no slots for physical replication.  Why are
slots mandatory in one case and not even allowable in the other?  Is
it sensible to think about slotless logical replication - or maybe I
should say, is it any LESS sensible than slotless physical
replication?

- What is the purpose of (Un)SuspendDecodingSnapshots?  It seems that
should be explained somewhere.  I have my doubts about how safe that
is.  And I definitely think that SetupDecodingSnapshots() is not OK.
Overwriting the satisfies functions in static pointers may be a great
way to make sure you've covered all bases during development, but I
can't see us wanting that ugliness in the official sources.

- I don't really like "time travel" as a name for reconstructing a
previous snapshot of a catalog.  Maybe it's as good as anything, but
it also doesn't help that "during decoding" is used in some places to
refer to the same concept.  I wonder if we should call these "logical
replication snapshots" or "historical MVCC snapshots" or somesuch and
then try to make the terminology consistent throughout.
ReorderBufferTXN->does_timetravel really means "time travel will be
needed to decode what this transaction did", which is not really the
same thing.

That's as much relatively-big-picture stuff as I'm able to notice on a
first read-through.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6

From
Andres Freund
Date:
Hi Robert,

On 2013-09-19 10:02:31 -0400, Robert Haas wrote:
> On Tue, Sep 17, 2013 at 10:31 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > Rebased patches attached.
> 
> I spent a bit of time looking at these patches yesterday and today.
> It seems to me that there's a fair amount of stylistic cleanup that is
> still needed, and some pretty bad naming choices, and some FIXMEs that
> should probably be fixed, but for an initial review email it seems
> more helpful to focus on high-level design, so here goes.

Thanks for looking at it.

Yes, I think the highlevel stuff is the important bit.

As you note, the documentation needs to be written and that's certainly
not a small task. But doing so before the highlevel design is agreed
upon makes it too likely that it will need to be entirely scrapped.

> - Looking specifically at the 0004 patch, I think that the
> RecentGlobalDataXmin changes are logically separate from the rest of
> the patch, and that if we're going to commit them at all, it should be
> separate from the rest of this.  I think this is basically a
> performance optimization.  AFAICS, there's no correctness problem with
> the idea of continuing to maintain a single RecentGlobalXmin; it's
> just that you'll potentially end up with quite a lot of bloat.  But
> that's not an argument that those two things HAVE to be committed
> together; either could be done first, and independently of the other.
> Also, these changes are quite complex and it's different to form a
> judgement as to whether this idea is safe when they are intermingled
> with the rest of the logical replication stuff.

Up until v3 the RecentGlobalDataXmin stuff wasn't included and reviewers
(primarily Peter G. on -hackers and Greg Stark at pgconf.eu) remarked on
that and considered it critical. I argued for a considerable amount of
time that it shouldn't be done in an initial patch and then gave in.

They have a point though, if you e.g. replicate a pgbench -j16 workload
the addition of RecentGlobalDataXmin reduces the performance impact of
replication from about 60% to less than 5% in my measurements. Turns out
heap pruning is damn critical for that kind of workload.

> More generally, the thing that bugs me about this approach is that
> logical replication is not really special, but what you've done here
> MAKES it special. There are plenty of other situations where we are
> too aggressive about holding back xmin.  A single-statement
> transaction that selects from a single table holds back xmin for the
> entire cluster, and that is Very Bad.  It would probably be unfair to
> say, well, you have to solve that problem first.  But on the other
> hand, how do we know that the approach you've adopted here isn't going
> to make the more general problem harder to solve?  It seems invasive
> at a pretty low level.

The reason why I think it's actually different is that the user actually
has control over how long transactions are running on the primary. They
don't really control how fast a replication consumer consumes and how
often it sends feedback messages.

> I think we should at least spend some time
> thinking about what *general* solutions to this problem would like
> like and then decide whether this is approach is likely to be
> forward-compatible with those solutions.

I thought about the general case for a good bit and decided that all
solutions that work in a more general scenario are complex enough that I
don't want to implement them. And I don't really see any backward
compatibility concerns here - removing the logic of using a separate
horizon for user tables in contrast to system tables is pretty trivial
and shouldn't have any external effect. Except pegging the horizon more,
but that's what the new approach would fix, right?

> - Given that we don't reassemble transactions until commit time, why
> do we need to to ensure that XIDs are logged before their sub-XIDs
> appear in WAL?

Currently it's important to know where the oldest transaction that is
alive started at to determine from where we need to restart
decoding. That's done by keeping a lsn-ordered list of in progress
toplevel transactions. The above invariant makes it cheap to maintain
that list.

> As I've said previously, I actually think that
> on-the-fly reassembly is probably going to turn out to be very
> important. But if we're not getting that, do we really need this?

It's also preparatory for supporting that.

I agree that it's pretty important, but after actually having
implemented a replication solution using this, I still think that most
usecase won't using it when available. I plan to work on implementing
that.

> Also, while I'm trying to keep this email focused on high-level
> concerns, I have to say that guaranteedlyLogged has got to be one of
> the worst variable names I've ever seen, starting (but not ending)
> with the fact that guaranteedly is not a word.  I'm also tempted to
> say that all of the wal_level=logical changes should be split out as
> their own patch, separate from the decoding stuff.  Personally, I
> would find that much easier to review, although I admit it's less
> critical here than for the RecentGlobalDataXmin stuff.

I can do that again and it actually was that way in the past. But
there's no user for it before the later patch and it's hard to
understand the reasoning for the changed wal logging separately, that's
why I merged it at some point.

> - If XLOG_HEAP_INPLACE is not decoded, doesn't that mean that this
> facility can't be used to replicate a system catalog table?  Is that a
> restriction we should enforce/document somehow?

Currently catalog tables aren't replicated, yes. They simply are skipped
during decoding. XLOG_HEAP_INPLACE isn't the primary reason for that
though.

Do you see a usecase for it?.

> - The new code is rather light on comments.  decode.c is extremely
> light.

Will improve. I think most of the other code is better commented, but it
still could use quite a bit of improvement nonethless.

> - It still bothers me that we're going to have mandatory slots for
> logical replication and no slots for physical replication.  Why are
> slots mandatory in one case and not even allowable in the other?  Is
> it sensible to think about slotless logical replication - or maybe I
> should say, is it any LESS sensible than slotless physical
> replication?

Well, as you know, I do want to have slots for physical replication as
well. But there actually is a fundamental difference why we need it for
logical rep and not for physical: In physical replication, if the xmin
progresses too far, client queries will be cancelled. Annoying but not
fatal. In logical replication we will not be able to continue
replicating since we cannot decode the WAL stream without a valid
catalog snapshot. If xmin already has progressed too far the tuples
won't be there anymore.

If people think this needs to be a general facility from the start, I
can be convinced that way, but I think there's so much to discuss around
the semantics and different usecases that I'd much prefer to discuss
that later.

> - What is the purpose of (Un)SuspendDecodingSnapshots?  It seems that
> should be explained somewhere.  I have my doubts about how safe that
> is.

I'll document the details if they aren't right now. Consider what
happens if somebody does something like: "VACUUM FULL pg_am;". If we
were to build the relation descriptor of pg_am in an "historical
snapshot", as you coin it, we'd have the wrong filenode in there. And
consequently any future lookups in pg_am will open a file that doesn't
exist.
That problem only exist for non-nailed relations that are accessed
during decoding.

>  And I definitely think that SetupDecodingSnapshots() is not OK.
> Overwriting the satisfies functions in static pointers may be a great
> way to make sure you've covered all bases during development, but I
> can't see us wanting that ugliness in the official sources.

Yes, I don't like it either. I am not sure what to replace it with
though. It's easy enough to fit something in GetCatalogSnapshot() and I
don't have a problem with that, but I am less happy with adding code
like that to GetSnapshotData() for callers that use explicit snapshots.

> - I don't really like "time travel" as a name for reconstructing a
> previous snapshot of a catalog.  Maybe it's as good as anything, but
> it also doesn't help that "during decoding" is used in some places to
> refer to the same concept.

Heh, I think that's me trying to avoid repeating the same term over and
over subconsciously.

> I wonder if we should call these "logical replication snapshots" or
> "historical MVCC snapshots" or somesuch and then try to make the
> terminology consistent throughout.  ReorderBufferTXN->does_timetravel
> really means "time travel will be needed to decode what this
> transaction did", which is not really the same thing.

Hm. ->does_timetravel really is badly named. Yuck. Should be
'->modifies_catalog' or similar.

I'll think whether I can agree with either of the suggested terms or can
think of a better one. Till then I'll try to make the comments more
consistent.

Thanks!

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6

From
Robert Haas
Date:
On Thu, Sep 19, 2013 at 10:43 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> - Looking specifically at the 0004 patch, I think that the
>> RecentGlobalDataXmin changes are logically separate from the rest of
>> the patch, and that if we're going to commit them at all, it should be
>> separate from the rest of this.  I think this is basically a
>> performance optimization.  AFAICS, there's no correctness problem with
>> the idea of continuing to maintain a single RecentGlobalXmin; it's
>> just that you'll potentially end up with quite a lot of bloat.  But
>> that's not an argument that those two things HAVE to be committed
>> together; either could be done first, and independently of the other.
>> Also, these changes are quite complex and it's different to form a
>> judgement as to whether this idea is safe when they are intermingled
>> with the rest of the logical replication stuff.
>
> Up until v3 the RecentGlobalDataXmin stuff wasn't included and reviewers
> (primarily Peter G. on -hackers and Greg Stark at pgconf.eu) remarked on
> that and considered it critical. I argued for a considerable amount of
> time that it shouldn't be done in an initial patch and then gave in.
>
> They have a point though, if you e.g. replicate a pgbench -j16 workload
> the addition of RecentGlobalDataXmin reduces the performance impact of
> replication from about 60% to less than 5% in my measurements. Turns out
> heap pruning is damn critical for that kind of workload.

No question.  I'm not saying that that optimization shouldn't go in
right after the main patch does, but IMHO right now there are too many
things going in the 0004 patch to discuss them all simultaneously.
I'd like to find a way of splitting this up that will let us
block-and-tackle individual pieces of it, even we end up committing
them all one right after the other.

But that raises an interesting question: why is the overhead so bad?
I mean, this shouldn't be any worse than having a series of
transactions running concurrently with pgbench that take a snapshot
and hold it for as long as it takes the decoding process to decode the
most-recently committed transaction.  Is the issue here that we can't
advance xmin until we've fsync'd the fruits of decoding down to disk?
If so, that's mighty painful.  But we'd really only need to hold back
xmin in that situation when some catalog change has occurred
meanwhile, which for pgbench means never.  So something seems fishy
here.

> I thought about the general case for a good bit and decided that all
> solutions that work in a more general scenario are complex enough that I
> don't want to implement them. And I don't really see any backward
> compatibility concerns here - removing the logic of using a separate
> horizon for user tables in contrast to system tables is pretty trivial
> and shouldn't have any external effect. Except pegging the horizon more,
> but that's what the new approach would fix, right?

Hmm, maybe.

>> Also, while I'm trying to keep this email focused on high-level
>> concerns, I have to say that guaranteedlyLogged has got to be one of
>> the worst variable names I've ever seen, starting (but not ending)
>> with the fact that guaranteedly is not a word.  I'm also tempted to
>> say that all of the wal_level=logical changes should be split out as
>> their own patch, separate from the decoding stuff.  Personally, I
>> would find that much easier to review, although I admit it's less
>> critical here than for the RecentGlobalDataXmin stuff.
>
> I can do that again and it actually was that way in the past. But
> there's no user for it before the later patch and it's hard to
> understand the reasoning for the changed wal logging separately, that's
> why I merged it at some point.

OK.  If I'm committing it, I'd prefer to handle that piece separately,
if possible.

>> - If XLOG_HEAP_INPLACE is not decoded, doesn't that mean that this
>> facility can't be used to replicate a system catalog table?  Is that a
>> restriction we should enforce/document somehow?
>
> Currently catalog tables aren't replicated, yes. They simply are skipped
> during decoding. XLOG_HEAP_INPLACE isn't the primary reason for that
> though.
>
> Do you see a usecase for it?

I can imagine someone wanting to do it, but I think we can live with
it not being supported.

>> - It still bothers me that we're going to have mandatory slots for
>> logical replication and no slots for physical replication.  Why are
>> slots mandatory in one case and not even allowable in the other?  Is
>> it sensible to think about slotless logical replication - or maybe I
>> should say, is it any LESS sensible than slotless physical
>> replication?
>
> Well, as you know, I do want to have slots for physical replication as
> well. But there actually is a fundamental difference why we need it for
> logical rep and not for physical: In physical replication, if the xmin
> progresses too far, client queries will be cancelled. Annoying but not
> fatal. In logical replication we will not be able to continue
> replicating since we cannot decode the WAL stream without a valid
> catalog snapshot. If xmin already has progressed too far the tuples
> won't be there anymore.
>
> If people think this needs to be a general facility from the start, I
> can be convinced that way, but I think there's so much to discuss around
> the semantics and different usecases that I'd much prefer to discuss
> that later.

I'm worried that if we don't know how the physical replication slots
are going to work, they'll end up being randomly different from the
logical replication slots, and that'll be an API wart which we'll have
a hard time getting rid of later.

>> - What is the purpose of (Un)SuspendDecodingSnapshots?  It seems that
>> should be explained somewhere.  I have my doubts about how safe that
>> is.
>
> I'll document the details if they aren't right now. Consider what
> happens if somebody does something like: "VACUUM FULL pg_am;". If we
> were to build the relation descriptor of pg_am in an "historical
> snapshot", as you coin it, we'd have the wrong filenode in there. And
> consequently any future lookups in pg_am will open a file that doesn't
> exist.
> That problem only exist for non-nailed relations that are accessed
> during decoding.

But if it's some user table flagged with the terribly-named
treat_as_catalog_table flag, then they could have not only changed the
relfilenode but also the tupledesc.  And then you can't just wave your
hands at the problem.

>>  And I definitely think that SetupDecodingSnapshots() is not OK.
>> Overwriting the satisfies functions in static pointers may be a great
>> way to make sure you've covered all bases during development, but I
>> can't see us wanting that ugliness in the official sources.
>
> Yes, I don't like it either. I am not sure what to replace it with
> though. It's easy enough to fit something in GetCatalogSnapshot() and I
> don't have a problem with that, but I am less happy with adding code
> like that to GetSnapshotData() for callers that use explicit snapshots.

I'm not sure exactly what a good solution would like, either.  I just
think this isn't it.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6

From
Andres Freund
Date:
Hi,

On 2013-09-19 12:05:35 -0400, Robert Haas wrote:
> No question.  I'm not saying that that optimization shouldn't go in
> right after the main patch does, but IMHO right now there are too many
> things going in the 0004 patch to discuss them all simultaneously.
> I'd like to find a way of splitting this up that will let us
> block-and-tackle individual pieces of it, even we end up committing
> them all one right after the other.

Fine with me. I was critized for splitting up stuff too much before ;)

Expect a finer-grained series.

> But that raises an interesting question: why is the overhead so bad?
> I mean, this shouldn't be any worse than having a series of
> transactions running concurrently with pgbench that take a snapshot
> and hold it for as long as it takes the decoding process to decode the
> most-recently committed transaction.

Pgbench really slows down scarily if there are some slightly longer
running transactions around...

> Is the issue here that we can't
> advance xmin until we've fsync'd the fruits of decoding down to disk?

Basically yes. We only advance the xmin of the slot so far that we could
still build a valid snapshot to decode the first transaction not
confirmed to have been synced to disk by the client.

> If so, that's mighty painful.  But we'd really only need to hold back
> xmin in that situation when some catalog change has occurred
> meanwhile, which for pgbench means never.  So something seems fishy
> here.

It's less simple than that. We need to protect against concurrent DDL
producing deleted rows that we will still need. We need
HeapTupleStisfiesVacuum() to return HEAPTUPLE_RECENTLY_DEAD not
HEAPTUPLE_DEAD for such rows, right?
The way to do that is to guarantee that if
TransactionIdDidCommit(xmax) is true, TransactionIdPrecedes(xmax, OldestXmin) is also true.
So, we need to peg OldestXmin (as passed to HTSV) to the xid of the
oldest transaction we're still decoding.

I am not sure how you could do that iff somewhere in the future DDL has
started since there's no interlock preventing anyone against doing so.

> >> - It still bothers me that we're going to have mandatory slots for
> >> logical replication and no slots for physical replication.

> > If people think this needs to be a general facility from the start, I
> > can be convinced that way, but I think there's so much to discuss around
> > the semantics and different usecases that I'd much prefer to discuss
> > that later.
> 
> I'm worried that if we don't know how the physical replication slots
> are going to work, they'll end up being randomly different from the
> logical replication slots, and that'll be an API wart which we'll have
> a hard time getting rid of later.

Hm. I actually think that minus some s/Logical//g and a mv won't be much
need to change on the slot interface itself.

What we need for physical rep is basically to a) store the position up
to where the primary has fsynced the WAL b) store the xmin horizon the standby
currently has.
Sure, we can store more stats (most of pg_stat_replication, perhaps some
more) but that's not functionally critical and not hard to extend.

The points I find daunting are the semantics, like:
* How do we control whether a standby is allowed prevent WAL file removal. What if archiving is configured?
* How do we control whether a standby is allowed to peg xmin?
* How long do we peg an xmin/wal file removal if the standby is gone
* How does the userinterface look to remove a slot if a standby is gone
* How do we decide/control which commands use a slot in which cases?


> >> - What is the purpose of (Un)SuspendDecodingSnapshots?  It seems that
> >> should be explained somewhere.  I have my doubts about how safe that
> >> is.
> >
> > I'll document the details if they aren't right now. Consider what
> > happens if somebody does something like: "VACUUM FULL pg_am;". If we
> > were to build the relation descriptor of pg_am in an "historical
> > snapshot", as you coin it, we'd have the wrong filenode in there. And
> > consequently any future lookups in pg_am will open a file that doesn't
> > exist.
> > That problem only exist for non-nailed relations that are accessed
> > during decoding.
> 
> But if it's some user table flagged with the terribly-named
> treat_as_catalog_table flag, then they could have not only changed the
> relfilenode but also the tupledesc.  And then you can't just wave your
> hands at the problem.

Heh. Well cought.

There's a comment about that somewhere... Those are problematic, my plan
so far is to throw my hands up and forbid alter tables that rewrite
those.

I know you don't like that flag and especially it's name. I am open to
suggestions to a) rename it b) find a better solution. I am pretty sure
a) is possible but I have severe doubts about any realistic b).

> > Yes, I don't like it either. I am not sure what to replace it with
> > though. It's easy enough to fit something in GetCatalogSnapshot() and I
> > don't have a problem with that, but I am less happy with adding code
> > like that to GetSnapshotData() for callers that use explicit snapshots.
> 
> I'm not sure exactly what a good solution would like, either.  I just
> think this isn't it.  :-)

I know that feeling ;)

Greetings,

Andres Freund



Re: logical changeset generation v6

From
Peter Geoghegan
Date:
On Thu, Sep 19, 2013 at 7:43 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> More generally, the thing that bugs me about this approach is that
>> logical replication is not really special, but what you've done here
>> MAKES it special. There are plenty of other situations where we are
>> too aggressive about holding back xmin.  A single-statement
>> transaction that selects from a single table holds back xmin for the
>> entire cluster, and that is Very Bad.  It would probably be unfair to
>> say, well, you have to solve that problem first.  But on the other
>> hand, how do we know that the approach you've adopted here isn't going
>> to make the more general problem harder to solve?  It seems invasive
>> at a pretty low level.

I agree that it's invasive, but I am doubtful that pegging the xmin in
a more granular fashion precludes this kind of optimization. We might
have to generalize what Andres has done, which could mean eventually
throwing it out and starting from scratch, but I have a hard time
seeing how that implies an appreciable cost above solving the general
problem first (now that Andres has already implemented the
RecentGlobalDataXmin thing). As I'm sure you appreciate, the cost of
doing the opposite - of solving the general problem first - may be
huge: waiting another release for logical changeset generation.

> The reason why I think it's actually different is that the user actually
> has control over how long transactions are running on the primary. They
> don't really control how fast a replication consumer consumes and how
> often it sends feedback messages.

Right. That's about what I said last year.

I find the following analogy useful: A logical changeset generation
implementation without RecentGlobalDataXmin is kind of like an
old-fashioned nuclear reactor, like the one they had at Chernobyl.
Engineers have to actively work in order to prevent it from
overheating. However, an implementation with RecentGlobalDataXmin is
like a modern, much safer nuclear reactor. Engineers have to actively
work to keep the reactor heated. Which is to say, with
RecentGlobalDataXmin a standby that dies cannot bloat the master too
much (almost as with hot_standby_feedback - that too requires active
participation from the standby to do harm to the master). Without
RecentGlobalDataXmin, the core system and the plugin at the very least
need to worry about that case when a standby dies.

I have a little bit of feedback that I forgot to mention in my earlier
reviews, because I thought it was too trivial then: something about
the name pg_receivellog annoys me in a way that the name
pg_receivexlog does not. Specifically, it looks like someone meant to
type pg_receivelog but fat-fingered it.

-- 
Peter Geoghegan



Re: logical changeset generation v6

From
Steve Singer
Date:
On 09/20/2013 06:33 AM, Andres Freund wrote:
> Hi,
>

> The points I find daunting are the semantics, like:
> * How do we control whether a standby is allowed prevent WAL file
>    removal. What if archiving is configured?
> * How do we control whether a standby is allowed to peg xmin?
> * How long do we peg an xmin/wal file removal if the standby is gone
> * How does the userinterface look to remove a slot if a standby is gone
> * How do we decide/control which commands use a slot in which cases?

I think we are going to want to be flexible enough to support users with 
a couple of different points of use-cases.
* Some people will want to keep xmin pegged and prevent WAL removal so a 
standby with a slot can always catch up, and wi
* Most people will want to say keep X megabytes of WA (if needed by a 
behind slot) and keep xmin pegged so that the WAL can be consumed by a 
logical plugin.

I can see us also implementing a restore_command that the walsender 
could use to get archived segments but for logical replication xmin 
would still need to be low enough

I don't think the current patch set is incompatible with us later 
implementing any of the above. I'd rather see us focus on getting the 
core functionality committed and worry about a good interface for 
managing slots later.


> Greetings, Andres Freund 

Steve




Re: logical changeset generation v6

From
Andres Freund
Date:
On 2013-09-20 14:15:23 -0700, Peter Geoghegan wrote:
> I have a little bit of feedback that I forgot to mention in my earlier
> reviews, because I thought it was too trivial then: something about
> the name pg_receivellog annoys me in a way that the name
> pg_receivexlog does not. Specifically, it looks like someone meant to
> type pg_receivelog but fat-fingered it.

Yes, you're not the first to dislike it (including me).

pg_receivelogical? Protest now or forever hold your peace.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6

From
Peter Geoghegan
Date:
On Mon, Sep 23, 2013 at 1:46 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> pg_receivelogical? Protest now or forever hold your peace.


I was thinking pg_receiveloglog, but that works just as well.

-- 
Peter Geoghegan



Re: logical changeset generation v6

From
Alvaro Herrera
Date:
Andres Freund escribió:
> On 2013-09-20 14:15:23 -0700, Peter Geoghegan wrote:
> > I have a little bit of feedback that I forgot to mention in my earlier
> > reviews, because I thought it was too trivial then: something about
> > the name pg_receivellog annoys me in a way that the name
> > pg_receivexlog does not. Specifically, it looks like someone meant to
> > type pg_receivelog but fat-fingered it.
> 
> Yes, you're not the first to dislike it (including me).
> 
> pg_receivelogical? Protest now or forever hold your peace.

I had proposed pg_recvlogical

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: logical changeset generation v6

From
Andres Freund
Date:
On 2013-09-23 13:47:05 -0300, Alvaro Herrera wrote:
> Andres Freund escribió:
> > On 2013-09-20 14:15:23 -0700, Peter Geoghegan wrote:
> > > I have a little bit of feedback that I forgot to mention in my earlier
> > > reviews, because I thought it was too trivial then: something about
> > > the name pg_receivellog annoys me in a way that the name
> > > pg_receivexlog does not. Specifically, it looks like someone meant to
> > > type pg_receivelog but fat-fingered it.
> >
> > Yes, you're not the first to dislike it (including me).
> >
> > pg_receivelogical? Protest now or forever hold your peace.
>
> I had proposed pg_recvlogical

I still find it wierd/inconsistent to have:
* pg_receivexlog
* pg_recvlogical
binaries, even from the same source directory. Why once "pg_recv" and
once "pg_receive"?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6

From
Peter Geoghegan
Date:
On Mon, Sep 23, 2013 at 9:54 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> I still find it wierd/inconsistent to have:
> * pg_receivexlog
> * pg_recvlogical
> binaries, even from the same source directory. Why once "pg_recv" and
> once "pg_receive"?

+1

-- 
Peter Geoghegan



Re: logical changeset generation v6

From
Alvaro Herrera
Date:
Andres Freund escribió:
> On 2013-09-23 13:47:05 -0300, Alvaro Herrera wrote:

> > I had proposed pg_recvlogical
> 
> I still find it wierd/inconsistent to have:
> * pg_receivexlog
> * pg_recvlogical
> binaries, even from the same source directory. Why once "pg_recv" and
> once "pg_receive"?

Well.  What are the principles we want to follow when choosing a name?
Is consistency the first and foremost consideration?  To me, that names
are exactly consistent is not all that relevant; I prefer a shorter name
if it embodies all it means.  For that reason I didn't like the
"receiveloglog" suggestion: it's not clear what are the two "log" bits.
To me this suggests that "logical" should not be shortened.  But the
"recv" thing is clear to be "receive", isn't it?  Enough that it can be
shortened without loss of meaning.

If we consider consistency in naming of tools is uber-important, well,
obviously my proposal is dead.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: logical changeset generation v6

From
Peter Eisentraut
Date:
On 9/23/13 12:54 PM, Andres Freund wrote:
> I still find it wierd/inconsistent to have:
> * pg_receivexlog
> * pg_recvlogical
> binaries, even from the same source directory. Why once "pg_recv" and
> once "pg_receive"?

It's consistent because they are the same length!

(Obviously, this would severely restrict future tool naming.)

In all seriousness, I like this naming best so far.



Re: logical changeset generation v6

From
Robert Haas
Date:
On Mon, Sep 23, 2013 at 1:11 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Andres Freund escribió:
>> On 2013-09-23 13:47:05 -0300, Alvaro Herrera wrote:
>
>> > I had proposed pg_recvlogical
>>
>> I still find it wierd/inconsistent to have:
>> * pg_receivexlog
>> * pg_recvlogical
>> binaries, even from the same source directory. Why once "pg_recv" and
>> once "pg_receive"?
>
> Well.  What are the principles we want to follow when choosing a name?
> Is consistency the first and foremost consideration?  To me, that names
> are exactly consistent is not all that relevant; I prefer a shorter name
> if it embodies all it means.  For that reason I didn't like the
> "receiveloglog" suggestion: it's not clear what are the two "log" bits.
> To me this suggests that "logical" should not be shortened.  But the
> "recv" thing is clear to be "receive", isn't it?  Enough that it can be
> shortened without loss of meaning.
>
> If we consider consistency in naming of tools is uber-important, well,
> obviously my proposal is dead.

What exactly is the purpose of this tool?  My impression is that the
"output" of logical replication is a series of function calls to a
logical replication plugin, but does that plugin necessarily have to
produce an output format that gets streamed to a client via a tool
like this?  For example, for replication, I'd think you might want the
plugin to connect to a remote database and directly shove the data in;
for materialized views, we might like to push the changes into delta
relations within the source database.  In either case, there's no
particular need for any sort of client at all, and in fact it would be
much better if none were required.  The existence of a tool like
pg_receivellog seems to presuppose that the goal is spit out logical
change records as text, but I'm not sure that's actually going to be a
very common thing to want to do...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6

From
Peter Geoghegan
Date:
On Mon, Sep 23, 2013 at 8:12 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> The existence of a tool like
> pg_receivellog seems to presuppose that the goal is spit out logical
> change records as text, but I'm not sure that's actually going to be a
> very common thing to want to do...

Sure, but I think it's still worth having, for debugging purposes and
so on. Perhaps the incorrect presupposition is that it deserves to
live in /bin and not /contrib. Also, even though the tool is
derivative of pg_receivexlog, its reason for existing is sufficiently
different that maybe it deserves an entirely distinct name. On the
other hand, precisely because it's derivative of
receivelog/pg_receivexlog, it kind of makes sense to group them
together like that. So I don't know.


-- 
Peter Geoghegan



Re: logical changeset generation v6

From
Andres Freund
Date:
On 2013-09-23 23:12:53 -0400, Robert Haas wrote:
> What exactly is the purpose of this tool?  My impression is that the
> "output" of logical replication is a series of function calls to a
> logical replication plugin, but does that plugin necessarily have to
> produce an output format that gets streamed to a client via a tool
> like this?

There needs to be a client acking the reception of the data in some
form. There's currently two output methods, SQL and walstreamer, but
there easily could be further, it's basically two functions you have
write.

There are several reasons I think the tool is useful, starting with the
fact that it makes the initial use of the feature easier. Writing a
client for CopyBoth messages wrapping 'w' style binary messages, with the
correct select() loop isn't exactly trivial. I also think it's actually
useful in "real" scenarios where you want to ship the data to a
remote system for auditing purposes.

> For example, for replication, I'd think you might want the
> plugin to connect to a remote database and directly shove the data in;

That sounds like a bad idea to me. If you pull the data from the remote
side, you get the data in a streaming fashion and the latency sensitive
part of issuing statements to your local database is done locally.
Doing things synchronously like that also makes it way harder to use
synchronous_commit = off on the remote side, which is a tremendous
efficiency win.

If somebody needs something like this, e.g. because they want to
replicate into hundreds of shards depending on some key or such, the
question I don't know is how to actually initiate the
streaming. Somebody would need to start the logical decoding.

> for materialized views, we might like to push the changes into delta
> relations within the source database.

Yes, that's not a bad usecase and I think the only thing missing to use
output plugins that way is a convenient function to tell up to where
data has been received (aka synced to disk, aka applied).

>  In either case, there's no
> particular need for any sort of client at all, and in fact it would be
> much better if none were required.  The existence of a tool like
> pg_receivellog seems to presuppose that the goal is spit out logical
> change records as text, but I'm not sure that's actually going to be a
> very common thing to want to do...

It doesn't really rely on anything being text - I've used it with a
binary plugin without problems. Obviously you might not want to use -f -
but an actual file instead...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6

From
Robert Haas
Date:
On Tue, Sep 24, 2013 at 4:15 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> There needs to be a client acking the reception of the data in some
> form. There's currently two output methods, SQL and walstreamer, but
> there easily could be further, it's basically two functions you have
> write.
>
> There are several reasons I think the tool is useful, starting with the
> fact that it makes the initial use of the feature easier. Writing a
> client for CopyBoth messages wrapping 'w' style binary messages, with the
> correct select() loop isn't exactly trivial. I also think it's actually
> useful in "real" scenarios where you want to ship the data to a
> remote system for auditing purposes.

I have two basic points here:

- Requiring a client is a short-sighted design.  There's no reason we
shouldn't *support* having a client, but IMHO it shouldn't be the only
way to use the feature.

- Suppose that you use pg_receivellog (or whatever we decide to call
it) to suck down logical replication messages.  What exactly are you
going to do with that data once you've got it?  In the case of
pg_receivexlog it's quite obvious what you will do with the received
files: you'll store them in archive of some kind and maybe eventually
use them for archive recovery, streaming replication, or PITR.  But
the answer here is a lot less obvious, at least to me.

>> For example, for replication, I'd think you might want the
>> plugin to connect to a remote database and directly shove the data in;
>
> That sounds like a bad idea to me. If you pull the data from the remote
> side, you get the data in a streaming fashion and the latency sensitive
> part of issuing statements to your local database is done locally.
> Doing things synchronously like that also makes it way harder to use
> synchronous_commit = off on the remote side, which is a tremendous
> efficiency win.

This sounds like the voice of experience talking, so I won't argue too
much, but I don't think it's central to my point.  And anyhow, even if
it is a bad idea, that doesn't mean someone won't want to do it.  :-)

> If somebody needs something like this, e.g. because they want to
> replicate into hundreds of shards depending on some key or such, the
> question I don't know is how to actually initiate the
> streaming. Somebody would need to start the logical decoding.

Sounds like a job for a background worker.  It would be pretty swell
if you could write a background worker that connects to a logical
replication slot and then does whatever.

>> for materialized views, we might like to push the changes into delta
>> relations within the source database.
>
> Yes, that's not a bad usecase and I think the only thing missing to use
> output plugins that way is a convenient function to tell up to where
> data has been received (aka synced to disk, aka applied).

Yes.  It feels to me (and I only work here) like the job of the output
plugin ought to be to put the data somewhere, and the replication code
shouldn't make too many assumptions about where it's actually going.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6

From
Andres Freund
Date:
On 2013-09-24 11:04:06 -0400, Robert Haas wrote:
> - Requiring a client is a short-sighted design.  There's no reason we
> shouldn't *support* having a client, but IMHO it shouldn't be the only
> way to use the feature.

There really aren't many limitations preventing you from doing anything
else.

> - Suppose that you use pg_receivellog (or whatever we decide to call
> it) to suck down logical replication messages.  What exactly are you
> going to do with that data once you've got it?  In the case of
> pg_receivexlog it's quite obvious what you will do with the received
> files: you'll store them in archive of some kind and maybe eventually
> use them for archive recovery, streaming replication, or PITR.  But
> the answer here is a lot less obvious, at least to me.

Well, it's not like it's going to be the only client. But it's a useful
one. I don't see this as an argument against pg_receivelogical? Most
sites don't use pg_receivexlog either.
Not having a consumer of the walsender interface included sounds like a
bad idea to me, even if it were only useful for testing. Now, you could
argue it should be in /contrib - and I wouldn't argue against that
except it shares code with the rest of src/bin/pg_basebackup.

> > If somebody needs something like this, e.g. because they want to
> > replicate into hundreds of shards depending on some key or such, the
> > question I don't know is how to actually initiate the
> > streaming. Somebody would need to start the logical decoding.

> Sounds like a job for a background worker.  It would be pretty swell
> if you could write a background worker that connects to a logical
> replication slot and then does whatever.

That's already possible. In that case you don't have to connect to a
walsender, although doing so would give you some parallelism, one
decoding the data, the other processing it ;).

There's one usecase I do not forsee decoupling from the walsender
interface this release though - synchronous logical replication. There
currently is no code changes required to make sync rep work for this,
and decoupling sync rep from walsender is too much to bite off in one
go.

> >> for materialized views, we might like to push the changes into delta
> >> relations within the source database.
> >
> > Yes, that's not a bad usecase and I think the only thing missing to use
> > output plugins that way is a convenient function to tell up to where
> > data has been received (aka synced to disk, aka applied).
> 
> Yes.  It feels to me (and I only work here) like the job of the output
> plugin ought to be to put the data somewhere, and the replication code
> shouldn't make too many assumptions about where it's actually going.

The output plugin just has two functions it calls to send out data,
'prepare_write' and 'write'. The callsite has to provide those
callbacks. Two are included. walsender and an SQL SRF.

Check the 'test_logical_decoding commit, it includes the SQL consumer.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6

From
Steve Singer
Date:
On 09/24/2013 11:21 AM, Andres Freund wrote:
> Not having a consumer of the walsender interface included sounds like a
> bad idea to me, even if it were only useful for testing. Now, you could
> argue it should be in /contrib - and I wouldn't argue against that
> except it shares code with the rest of src/bin/pg_basebackup.

+1 on pg_receivellog (or whatever better name we pick) being somewhere.
I found the pg_receivellog code very useful as an example and for 
debugging/development purposes.
It isn't something that I see useful for the average user and I think 
the use-cases it meets are closer to other things we usually put in /contrib

Steve




Re: logical changeset generation v6

From
Steve Singer
Date:
On 09/17/2013 10:31 AM, Andres Freund wrote:
> This patch set now fails to apply because of the commit "Rename various
> "freeze multixact" variables".
> And I am even partially guilty for that patch...
>
> Rebased patches attached.

While testing the logical replication changes against my WIP logical 
slony I am sometimes getting error messages from the WAL sender of the form:
unexpected duplicate for tablespace  X relfilenode  X

The call stack is

HeapTupleSatisfiesMVCCDuringDecoding
heap_hot_search_buffer
index_fetch_heap
index_getnext
systable_getnext
RelidByRelfilenode
ReorderBufferCommit DecodeCommit
.
.
.


I am working off something based on your version 
e0acfeace6d695c229efd5d78041a1b734583431


Any ideas?

> Greetings,
>
> Andres Freund
>
>
>




Re: logical changeset generation v6

From
Andres Freund
Date:
On 2013-09-25 11:01:44 -0400, Steve Singer wrote:
> On 09/17/2013 10:31 AM, Andres Freund wrote:
> >This patch set now fails to apply because of the commit "Rename various
> >"freeze multixact" variables".
> >And I am even partially guilty for that patch...
> >
> >Rebased patches attached.
> 
> While testing the logical replication changes against my WIP logical slony I
> am sometimes getting error messages from the WAL sender of the form:
> unexpected duplicate for tablespace  X relfilenode  X

Any chance you could provide a setup to reproduce the error?

> Any ideas?

I'll look into it. Could you provide any context to what youre doing
that's being decoded?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6

From
Steve Singer
Date:
On 09/25/2013 11:08 AM, Andres Freund wrote:
> On 2013-09-25 11:01:44 -0400, Steve Singer wrote:
>> On 09/17/2013 10:31 AM, Andres Freund wrote:
>>> This patch set now fails to apply because of the commit "Rename various
>>> "freeze multixact" variables".
>>> And I am even partially guilty for that patch...
>>>
>>> Rebased patches attached.
>> While testing the logical replication changes against my WIP logical slony I
>> am sometimes getting error messages from the WAL sender of the form:
>> unexpected duplicate for tablespace  X relfilenode  X
> Any chance you could provide a setup to reproduce the error?
>

The steps to build a setup that should reproduce this error are:

1.  I had apply the attached patch on top of your logical replication
branch so my pg_decode_init  would now if it was being called as part of
a INIT_REPLICATION or START_REPLICATION.
Unless I have misunderstood something you probably will want to merge
this fix in

2.  Get my WIP for adding logical support to slony from:
git@github.com:ssinger/slony1-engine.git branch logical_repl
(4af1917f8418a)
(My code changes to slony are more prototype level code quality than
production code quality)

3.
cd slony1-engine
./configure --with-pgconfigdir=/usr/local/pg94wal/bin     (or whatever)
make
make install

4.   Grab the clustertest framework JAR from
https://github.com/clustertest/clustertest-framework and build up a
clustertest jar file

5.  Create a file
slony1-engine/clustertest/conf/java.conf
that contains the path to the above JAR file  as a shell variable
assignment: ie
CLUSTERTESTJAR=/home/ssinger/src/clustertest/clustertest_git/build/jar/clustertest-coordinator.jar

6.
cp clustertest/conf/disorder.properties.sample
clustertest/conf/disorder.properties


edit disorder.properites to have the proper values for your
environment.  All 6 databases can point at the same postgres instance,
this test will only actually use 2 of them(so far).

7. Run the test
cd clustertest
./run_all_disorder_tests.sh

This involves having the slon connect to the walsender on the database
test1 and replicate the data into test2 (which is a different database
on the same postmaster)

If this setup seems like too much effort I can request one of the
commitfest VM's from Josh and get everything setup there for you.

Steve

>> Any ideas?
> I'll look into it. Could you provide any context to what youre doing
> that's being decoded?
>
> Greetings,
>
> Andres Freund
>


Attachment

Re: logical changeset generation v6

From
Steve Singer
Date:
On 09/25/2013 01:20 PM, Steve Singer wrote:
> On 09/25/2013 11:08 AM, Andres Freund wrote:
>> On 2013-09-25 11:01:44 -0400, Steve Singer wrote:
>>> On 09/17/2013 10:31 AM, Andres Freund wrote:
>>>> This patch set now fails to apply because of the commit "Rename 
>>>> various
>>>> "freeze multixact" variables".
>>>> And I am even partially guilty for that patch...
>>>>
>>>> Rebased patches attached.
>>> While testing the logical replication changes against my WIP logical 
>>> slony I
>>> am sometimes getting error messages from the WAL sender of the form:
>>> unexpected duplicate for tablespace  X relfilenode  X
>> Any chance you could provide a setup to reproduce the error?
>>
>
> The steps to build a setup that should reproduce this error are:
>
> 1.  I had apply the attached patch on top of your logical replication 
> branch so my pg_decode_init  would now if it was being called as part 
> of a INIT_REPLICATION or START_REPLICATION.
> Unless I have misunderstood something you probably will want to merge 
> this fix in
>
> 2.  Get my WIP for adding logical support to slony from: 
> git@github.com:ssinger/slony1-engine.git branch logical_repl 
> (4af1917f8418a)
> (My code changes to slony are more prototype level code quality than 
> production code quality)
>
> 3.
> cd slony1-engine
> ./configure --with-pgconfigdir=/usr/local/pg94wal/bin     (or whatever)
> make
> make install
>
> 4.   Grab the clustertest framework JAR from 
> https://github.com/clustertest/clustertest-framework and build up a 
> clustertest jar file
>
> 5.  Create a file
> slony1-engine/clustertest/conf/java.conf
> that contains the path to the above JAR file  as a shell variable 
> assignment: ie
> CLUSTERTESTJAR=/home/ssinger/src/clustertest/clustertest_git/build/jar/clustertest-coordinator.jar 
>
>
> 6.
> cp clustertest/conf/disorder.properties.sample 
> clustertest/conf/disorder.properties
>
>
> edit disorder.properites to have the proper values for your 
> environment.  All 6 databases can point at the same postgres instance, 
> this test will only actually use 2 of them(so far).
>
> 7. Run the test
> cd clustertest
> ./run_all_disorder_tests.sh
>
> This involves having the slon connect to the walsender on the database 
> test1 and replicate the data into test2 (which is a different database 
> on the same postmaster)
>
> If this setup seems like too much effort I can request one of the 
> commitfest VM's from Josh and get everything setup there for you.
>
> Steve
>
>>> Any ideas?
>> I'll look into it. Could you provide any context to what youre doing
>> that's being decoded?
>>


I've determined that when in this test the walsender seems to be hitting 
this when it is decode the transactions that are behind the slonik 
commands to add tables to replication (set add table, set add 
sequence).  This is before the SUBSCRIBE SET is submitted.

I've also noticed something else that is strange (but might be 
unrelated).  If I stop my slon process and restart it I get messages like:

WARNING:  Starting logical replication from 0/a9321360
ERROR:  cannot stream from 0/A9321360, minimum is 0/A9320B00

Where 0/A9321360 was sent in the last packet my slon received from the 
walsender before the restart.

If force it to restart replication from 0/A9320B00 I see datarows that I 
appear to have already seen before the restart.
I think this is happening when I process the data for 0/A9320B00 but 
don't get the feedback message my slon was killed. Is this expected?



>> Greetings,
>>
>> Andres Freund
>>
>
>
>




Re: logical changeset generation v6.1

From
Andres Freund
Date:
Hi,

Attached you can find an updated version of the series taking in some of
the review comments (the others are queued, not ignored), including:
* split of things from the big "Introduce wal decoding via ..." patch
* fix the bug Steve notice where CreateLogicalDecodingContext was passed
  the wrong is_init = false where it should have been true
* A number of smaller bugs I noticed while reviewing
* Renaming of some variables, including guaranteedlyLogged ;)
* Comment improvements in decode.c
* rename pg_receivellog to pg_recvlogical

I'll work more on the other points in the next days, so far they are
clear of other big stuff.


0001 wal_decoding: Allow walsender's to connect to a specific database
- as before

0002 wal_decoding: Log xl_running_xact's at a higher frequency than checkpoints are done
- as before

0003 wal_decoding: Add information about a tables primary key to struct RelationData
- as before

0004 wal_decoding: Add wal_level = logical and log data required for logical decoding
- splitof patch that contains the wal format changes including the
  addition of a new wal_level option

0005 wal_decoding: Add option to treat additional tables as catalog tables
- Option to treat user defined table as a catalog table which means it
  can be accessed during logical decoding from an output plugin

0006 wal_decoding: Introduce wal decoding via catalog timetravel
- The guts of changeset extraction, without a user interface

0007 wal_decoding: logical changeset extraction walsender interface
- splitof patch containing the walsender changes, which allow to receive
  the changeset data in a streaming fashion, supporting sync rep and
  such fancy things

0008 wal_decoding: Only peg the xmin horizon for catalog tables during logical decoding
- splitof optimization which reduces the pain 06 introduces by pegging
  the xmin horizon to the smallest of the logical decoding slots. Now
  it's pegged differently for data tables than from catalog tables

0009 wal_decoding: test_decoding: Add a simple decoding module in contrib
- Example output plugin which is also used in tests

0010 wal_decoding: pg_recvlogical: Introduce pg_receivexlog equivalent for logical changes
- renamed client for the walsender interface

0011 wal_decoding: test_logical_decoding: Add extension for easier testing of logical decoding
- SQL SRF to get data from a decoding slot, also used as a vehicle for
  tests

0012 wal_decoding: design document v2.4 and snapshot building design doc v0.5

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: logical changeset generation v6.1

From
Thom Brown
Date:
On 27 September 2013 16:14, Andres Freund <andres@2ndquadrant.com> wrote:
> Hi,
>
> Attached you can find an updated version of the series taking in some of
> the review comments (the others are queued, not ignored), including:
> * split of things from the big "Introduce wal decoding via ..." patch
> * fix the bug Steve notice where CreateLogicalDecodingContext was passed
>   the wrong is_init = false where it should have been true
> * A number of smaller bugs I noticed while reviewing
> * Renaming of some variables, including guaranteedlyLogged ;)
> * Comment improvements in decode.c
> * rename pg_receivellog to pg_recvlogical
>
> I'll work more on the other points in the next days, so far they are
> clear of other big stuff.
>
>
> 0001 wal_decoding: Allow walsender's to connect to a specific database
> - as before
>
> 0002 wal_decoding: Log xl_running_xact's at a higher frequency than checkpoints are done
> - as before
>
> 0003 wal_decoding: Add information about a tables primary key to struct RelationData
> - as before
>
> 0004 wal_decoding: Add wal_level = logical and log data required for logical decoding
> - splitof patch that contains the wal format changes including the
>   addition of a new wal_level option
>
> 0005 wal_decoding: Add option to treat additional tables as catalog tables
> - Option to treat user defined table as a catalog table which means it
>   can be accessed during logical decoding from an output plugin
>
> 0006 wal_decoding: Introduce wal decoding via catalog timetravel
> - The guts of changeset extraction, without a user interface
>
> 0007 wal_decoding: logical changeset extraction walsender interface
> - splitof patch containing the walsender changes, which allow to receive
>   the changeset data in a streaming fashion, supporting sync rep and
>   such fancy things
>
> 0008 wal_decoding: Only peg the xmin horizon for catalog tables during logical decoding
> - splitof optimization which reduces the pain 06 introduces by pegging
>   the xmin horizon to the smallest of the logical decoding slots. Now
>   it's pegged differently for data tables than from catalog tables
>
> 0009 wal_decoding: test_decoding: Add a simple decoding module in contrib
> - Example output plugin which is also used in tests
>
> 0010 wal_decoding: pg_recvlogical: Introduce pg_receivexlog equivalent for logical changes
> - renamed client for the walsender interface
>
> 0011 wal_decoding: test_logical_decoding: Add extension for easier testing of logical decoding
> - SQL SRF to get data from a decoding slot, also used as a vehicle for
>   tests
>
> 0012 wal_decoding: design document v2.4 and snapshot building design doc v0.5

I'm encountering a make error:

install  pg_basebackup '/home/thom/Development/psql/bin/pg_basebackup'
install  pg_receivexlog '/home/thom/Development/psql/bin/pg_receivexlog'
install  pg_recvlogical(X) '/home/thom/Development/psql/bin/pg_receivellog'
/bin/dash: 1: Syntax error: "(" unexpected
make[3]: *** [install] Error 2
make[3]: Leaving directory
`/home/thom/Development/postgresql/src/bin/pg_basebackup'
make[2]: *** [install-pg_basebackup-recurse] Error 2

Thom



Re: logical changeset generation v6.1

From
Andres Freund
Date:
On 2013-09-27 16:35:53 +0100, Thom Brown wrote:
> On 27 September 2013 16:14, Andres Freund <andres@2ndquadrant.com> wrote:
> > Hi,
> >
> > Attached you can find an updated version of the series taking in some of
> > the review comments (the others are queued, not ignored), including:
> > * split of things from the big "Introduce wal decoding via ..." patch
> > * fix the bug Steve notice where CreateLogicalDecodingContext was passed
> >   the wrong is_init = false where it should have been true
> > * A number of smaller bugs I noticed while reviewing
> > * Renaming of some variables, including guaranteedlyLogged ;)
> > * Comment improvements in decode.c
> > * rename pg_receivellog to pg_recvlogical
> >
> > I'll work more on the other points in the next days, so far they are
> > clear of other big stuff.
> >
> >
> > 0001 wal_decoding: Allow walsender's to connect to a specific database
> > - as before
> >
> > 0002 wal_decoding: Log xl_running_xact's at a higher frequency than checkpoints are done
> > - as before
> >
> > 0003 wal_decoding: Add information about a tables primary key to struct RelationData
> > - as before
> >
> > 0004 wal_decoding: Add wal_level = logical and log data required for logical decoding
> > - splitof patch that contains the wal format changes including the
> >   addition of a new wal_level option
> >
> > 0005 wal_decoding: Add option to treat additional tables as catalog tables
> > - Option to treat user defined table as a catalog table which means it
> >   can be accessed during logical decoding from an output plugin
> >
> > 0006 wal_decoding: Introduce wal decoding via catalog timetravel
> > - The guts of changeset extraction, without a user interface
> >
> > 0007 wal_decoding: logical changeset extraction walsender interface
> > - splitof patch containing the walsender changes, which allow to receive
> >   the changeset data in a streaming fashion, supporting sync rep and
> >   such fancy things
> >
> > 0008 wal_decoding: Only peg the xmin horizon for catalog tables during logical decoding
> > - splitof optimization which reduces the pain 06 introduces by pegging
> >   the xmin horizon to the smallest of the logical decoding slots. Now
> >   it's pegged differently for data tables than from catalog tables
> >
> > 0009 wal_decoding: test_decoding: Add a simple decoding module in contrib
> > - Example output plugin which is also used in tests
> >
> > 0010 wal_decoding: pg_recvlogical: Introduce pg_receivexlog equivalent for logical changes
> > - renamed client for the walsender interface
> >
> > 0011 wal_decoding: test_logical_decoding: Add extension for easier testing of logical decoding
> > - SQL SRF to get data from a decoding slot, also used as a vehicle for
> >   tests
> >
> > 0012 wal_decoding: design document v2.4 and snapshot building design doc v0.5
>
> I'm encountering a make error:

Gah. Lastminute changes. Always the same... Updated patch attached.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: logical changeset generation v6

From
Steve Singer
Date:
On 09/26/2013 02:47 PM, Steve Singer wrote:
>
>
> I've determined that when in this test the walsender seems to be 
> hitting this when it is decode the transactions that are behind the 
> slonik commands to add tables to replication (set add table, set add 
> sequence).  This is before the SUBSCRIBE SET is submitted.
>
> I've also noticed something else that is strange (but might be 
> unrelated).  If I stop my slon process and restart it I get messages 
> like:
>
> WARNING:  Starting logical replication from 0/a9321360
> ERROR:  cannot stream from 0/A9321360, minimum is 0/A9320B00
>
> Where 0/A9321360 was sent in the last packet my slon received from the 
> walsender before the restart.
>
> If force it to restart replication from 0/A9320B00 I see datarows that 
> I appear to have already seen before the restart.
> I think this is happening when I process the data for 0/A9320B00 but 
> don't get the feedback message my slon was killed. Is this expected?
>
>

I've further narrowed this down to something (or the combination of) 
what the  _disorder_replica.altertableaddTriggers(1);
stored function does.  (or @SLONYNAMESPACE@.altertableaddTriggers(int);

Which is essentially
* Get an exclusive lock on sl_config_lock
* Get an exclusive lock on the user table in question
* create a trigger (the deny access trigger)
* create a truncate trigger
* create a deny truncate trigger

I am not yet able to replicate the error by issuing the same SQL 
commands from psql, but I must be missing something.

I can replicate this when just using the test_decoding plugin.








>
>>> Greetings,
>>>
>>> Andres Freund
>>>
>>
>>
>>
>
>
>




Re: logical changeset generation v6

From
Andres Freund
Date:
Hi Steve,

On 2013-09-27 17:06:59 -0400, Steve Singer wrote:
> >I've determined that when in this test the walsender seems to be hitting
> >this when it is decode the transactions that are behind the slonik
> >commands to add tables to replication (set add table, set add sequence).
> >This is before the SUBSCRIBE SET is submitted.
> >
> >I've also noticed something else that is strange (but might be unrelated).
> >If I stop my slon process and restart it I get messages like:
> >
> >WARNING:  Starting logical replication from 0/a9321360
> >ERROR:  cannot stream from 0/A9321360, minimum is 0/A9320B00
> >
> >Where 0/A9321360 was sent in the last packet my slon received from the
> >walsender before the restart.

Uh, that looks like I fumbled some comparison. Let me check.

> I've further narrowed this down to something (or the combination of) what
> the  _disorder_replica.altertableaddTriggers(1);
> stored function does.  (or @SLONYNAMESPACE@.altertableaddTriggers(int);
> 
> Which is essentially
> * Get an exclusive lock on sl_config_lock
> * Get an exclusive lock on the user table in question
> * create a trigger (the deny access trigger)
> * create a truncate trigger
> * create a deny truncate trigger
> 
> I am not yet able to replicate the error by issuing the same SQL commands
> from psql, but I must be missing something.
> 
> I can replicate this when just using the test_decoding plugin.

Thanks. That should get me started with debugging. Unless it's possibly
fixed in the latest version, one bug fixed there might cause something
like this if the moon stands exactly right?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.1

From
Steve Singer
Date:
On 09/27/2013 11:44 AM, Andres Freund wrote:
> I'm encountering a make error:
> Gah. Lastminute changes. Always the same... Updated patch attached.
>
> Greetings,
>
> Andres Freund
>
>

I'm still encountering an error in the make.

make clean
.
.make[3]: Entering directory 
`/usr/local/src/postgresql/src/bin/pg_basebackup'
rm -f pg_basebackup pg_receivexlog pg_recvlogical(X) \        pg_basebackup.o pg_receivexlog.o pg_recvlogical.o \
receivelog.o streamutil.o
 
/bin/sh: 1: Syntax error: "(" unexpected
make[3]: *** [clean] Error 2

I had to add a quotes in to the clean commands to make it work



>




Re: logical changeset generation v6

From
Steve Singer
Date:
On 09/27/2013 05:18 PM, Andres Freund wrote:
> Hi Steve,
>
> On 2013-09-27 17:06:59 -0400, Steve Singer wrote:
>>> I've determined that when in this test the walsender seems to be hitting
>>> this when it is decode the transactions that are behind the slonik
>>> commands to add tables to replication (set add table, set add sequence).
>>> This is before the SUBSCRIBE SET is submitted.
>>>
>>> I've also noticed something else that is strange (but might be unrelated).
>>> If I stop my slon process and restart it I get messages like:
>>>
>>> WARNING:  Starting logical replication from 0/a9321360
>>> ERROR:  cannot stream from 0/A9321360, minimum is 0/A9320B00
>>>
>>> Where 0/A9321360 was sent in the last packet my slon received from the
>>> walsender before the restart.
> Uh, that looks like I fumbled some comparison. Let me check.
>
>> I've further narrowed this down to something (or the combination of) what
>> the  _disorder_replica.altertableaddTriggers(1);
>> stored function does.  (or @SLONYNAMESPACE@.altertableaddTriggers(int);
>>
>> Which is essentially
>> * Get an exclusive lock on sl_config_lock
>> * Get an exclusive lock on the user table in question
>> * create a trigger (the deny access trigger)
>> * create a truncate trigger
>> * create a deny truncate trigger
>>
>> I am not yet able to replicate the error by issuing the same SQL commands
>> from psql, but I must be missing something.
>>
>> I can replicate this when just using the test_decoding plugin.
> Thanks. That should get me started with debugging. Unless it's possibly
> fixed in the latest version, one bug fixed there might cause something
> like this if the moon stands exactly right?

The latest version has NOT fixed the problem.

Also, I was a bit inaccurate in my previous descriptions. To clarify:

1.   I sometimes am getting that 'unexpected duplicate' error
2.   The 'set add table ' which triggers those functions that create and 
configure triggers is actually causing the walsender to hit the 
following assertion
2  0x0000000000773d47 in ExceptionalCondition (    conditionName=conditionName@entry=0x8cf400 "!(ent->cmin == 
change->tuplecid.cmin)", errorType=errorType@entry=0x7ab830 
"FailedAssertion",    fileName=fileName@entry=0x8cecc3 "reorderbuffer.c",    lineNumber=lineNumber@entry=1162) at
assert.c:54
#3  0x0000000000665480 in ReorderBufferBuildTupleCidHash (txn=0x1b6e610,    rb=<optimized out>) at
reorderbuffer.c:1162
#4  ReorderBufferCommit (rb=0x1b6e4f8, xid=<optimized out>,    commit_lsn=3461001952, end_lsn=<optimized out>) at
reorderbuffer.c:1285
#5  0x000000000065f0f7 in DecodeCommit (xid=<optimized out>,    nsubxacts=<optimized out>, sub_xids=<optimized out>,
ninval_msgs=16,   msgs=0x1b637c0, buf=0x7fff54d01530, buf=0x7fff54d01530, ctx=0x1adb928,    ctx=0x1adb928) at
decode.c:477


I had added an assert(false) to the code where the 'unknown duplicate' 
error was logged to make spotting this easier but yesterday I didn't 
double check that I was hitting the assertion I added versus this other 
one.   I can't yet say if this is two unrelated issues or if I'd get to 
the 'unknown duplicate' message immediately after.




> Greetings,
>
> Andres Freund
>




Re: logical changeset generation v6.1

From
Alvaro Herrera
Date:
Steve Singer wrote:

> I'm still encountering an error in the make.
> 
> make clean
> .
> .make[3]: Entering directory
> `/usr/local/src/postgresql/src/bin/pg_basebackup'
> rm -f pg_basebackup pg_receivexlog pg_recvlogical(X) \
>         pg_basebackup.o pg_receivexlog.o pg_recvlogical.o \
>         receivelog.o streamutil.o
> /bin/sh: 1: Syntax error: "(" unexpected
> make[3]: *** [clean] Error 2
> 
> I had to add a quotes in to the clean commands to make it work

The proper fix is to add a $ to the pg_recvlogical(X) in "clean" -- should be $(X)

There's another bug in the Makefile: the install target is installing
recvlogical$(X) as receivellog$(X).

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: logical changeset generation v6.1

From
Kevin Grittner
Date:
Andres Freund <andres@2ndquadrant.com> wrote:

> Attached you can find an updated version of the series taking in some of
> the review comments

I don't know whether this is related to the previously-reported
build problems, but when I apply each patch in turn, with make -j4
world && make check-world for each step, I die during compile of
0004.

make[4]: Entering directory `/home/kgrittn/pg/master/src/backend/access/transam'
gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g
-I../../../../src/include-D_GNU_SOURCE -I/usr/include/libxml2   -c -o xlog.o xlog.c -MMD -MP -MF .deps/xlog.Po 
xlog.c:44:33: fatal error: replication/logical.h: No such file or directory
compilation terminated.
make[4]: *** [xlog.o] Error 1

I tried maintainer-clean and a new ./configure to see if that would
get me past it; no joy.  I haven't dug further, but if this is not
a known issue I can poke around.  If it is known -- how do I get
past it?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.2

From
Andres Freund
Date:
Hi,

The series from friday was a bit too buggy - obviously I was too
tired. So here's a new one:

* fix pg_recvlogical makefile (Thanks Steve)
* fix two commits not compiling properly without later changes (Thanks Kevin)
* keep track of commit timestamps
* fix bugs with option passing in test_logical_decoding
* actually parse option values in test_decoding instead of just using the
  option name
* don't use anonymous structs in unions. That's compiler specific (msvc
  and gcc) before C11 on which we can't rely. That unfortunately will
  break output plugins because ReorderBufferChange need to qualify
  old/new tuples now
* improve error handling/cleanup in test_logical_decoding
* some minor cleanups

Patches attached, git tree updated.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: logical changeset generation v6.1

From
Robert Haas
Date:
Review comments on 0004:

- In heap_insert and heap_multi_insert, please rewrite the following
comment for clarity: "add record for the buffer without actual content
thats removed if fpw is done for that buffer".
- In heap_delete, the assignment to need_tuple_data() need not
separately check RelationNeedsWAL(), as RelationIsLogicallyLogged()
does that.
- It seems that HeapSatisfiesHOTandKeyUpdate is now
HeapSatisfiesHOTandKeyandCandidateKeyUpdate.  Considering I think this
was merely HeapSatisfiesHOTUpdate a year ago, it's hard not to be
afraid that something unscalable is happening to this function.  On a
related node, any overhead added here costs broadly; I'm not sure if
there's enough to worry about.
- MarkCurrentTransactionIdLoggedIfAny has superfluous braces.
- AssignTransactionId changes "Mustn't" to "May not", which seems like
an entirely pointless change.
- You've removed a blank line just before IsSystemRelation; this is an
unnecessary whitespace change.
- Do none of the callers of IsSystemRelation() care about the fact
that you've considerably changed the semantics?
- RelationIsDoingTimetravel is still a crappy name.  How about
RelationRequiredForLogicalDecoding?  And maybe the reloption
treat_as_catalog_table can become required_for_logical_decoding.
- I don't understand the comment in xl_heap_new_cid to the effect that
the combocid isn't needed for decoding.  How not?
- xlogreader.h includes an additional header with no other changes.
Doesn't seem right.
- relcache.h has a cuddled curly brace.

Review comments on 0003:

I have no problem with caching the primary key in the relcache, or
with using that as the default key for logical decoding, but I'm
extremely uncomfortable with the fallback strategy when no primary key
exists.  Choosing any old unique index that happens to present itself
as the primary key feels wrong to me.  The choice of key is
user-visible.  If we say, update the row with a = 1 to
(a,b,c)=(2,2,2), that's different than saying update the row with b =
1 to (a,b,c)=(2,2,2).  Suppose the previous contents of the target
table are (a,b,c)=(1,2,3) and (a,b,c)=(2,1,4).  You get different
answers depending on which you choose.  I think multi-master
replication just isn't going to work unless the two sides agree on the
key, and I think you'll get strange conflicts unless that key is
chosen by the user according to their business logic.

In single-master replication, being able to pick the key is clearly
not essential for correctness, but it's still desirable, because if
the system picks the "wrong" key, the change stream will in the end
get the database to the right state, but it may do so by "turning one
record into a different one" from the user's perspective.

All in all, it seems to me that we shouldn't try to punt.  Maybe we
should have something that works like ALTER TABLE name CLUSTER ON
index_name to configure which index should be used for logical
replication.  Possibly this same syntax could be used as ALTER
MATERIALIZED VIEW to set the candidate key for that case.

What happens if new unique indexes are created or old ones dropped
while logical replication is running?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.1

From
Andres Freund
Date:
Hi,

On 2013-10-01 10:07:19 -0400, Robert Haas wrote:
> - AssignTransactionId changes "Mustn't" to "May not", which seems like
> an entirely pointless change.

It was "Musn't" before ;). I am not sure why I changed it to "May not"
instead of "Mustn't".

> - Do none of the callers of IsSystemRelation() care about the fact
> that you've considerably changed the semantics?

Afaics no. I think the semantics are actually consistent until somebody
manually creates a relation in pg_catalog (using allow_...). And in that
case the new semantics actually seem more useful.

> - RelationIsDoingTimetravel is still a crappy name.  How about
> RelationRequiredForLogicalDecoding?  And maybe the reloption
> treat_as_catalog_table can become required_for_logical_decoding.

Fine with me.

> - I don't understand the comment in xl_heap_new_cid to the effect that
> the combocid isn't needed for decoding.  How not?

We don't use the combocid for antything - since we have the original
cmin/cmax, we can just use those and ignore the value of the combocid itself.

> - xlogreader.h includes an additional header with no other changes.
> Doesn't seem right.

Hm. I seem to remember having a reason for that, but for the heck can't
see it anymore...

> I have no problem with caching the primary key in the relcache, or
> with using that as the default key for logical decoding, but I'm
> extremely uncomfortable with the fallback strategy when no primary key
> exists.  Choosing any old unique index that happens to present itself
> as the primary key feels wrong to me.
> [stuff I don't disagree with]

People lobbied vigorously to allow candidate keys before. I personally
would never want to use anything but an actual primary key for
replication, but there's other usecases than replication.

I think it's going to be the domain of the replication solution to
enforce the presence of primary keys. I.e. they should (be able to) use
event triggers or somesuch to enforce it...

> All in all, it seems to me that we shouldn't try to punt.  Maybe we
> should have something that works like ALTER TABLE name CLUSTER ON
> index_name to configure which index should be used for logical
> replication.  Possibly this same syntax could be used as ALTER
> MATERIALIZED VIEW to set the candidate key for that case.

I'd be fine with that, but I am also not particularly interested in it
because I personally don't see much of a usecase.
For replication ISTM the only case where there would be no primary key
is a) initial load b) replacing the primary key by another index.

> What happens if new unique indexes are created or old ones dropped
> while logical replication is running?

Should just work, but I'll make sure the tests cover this.

The output plugin needs to lookup the current index used, and it will
use a consistent syscache state and thus will find the same index.
In bdr the output plugin simply includes the name of the index used in
the replication stream to make sure things are somewhat consistent.

Will fix or think about the rest.

Thanks,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.1

From
Robert Haas
Date:
On Tue, Oct 1, 2013 at 10:31 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> I have no problem with caching the primary key in the relcache, or
>> with using that as the default key for logical decoding, but I'm
>> extremely uncomfortable with the fallback strategy when no primary key
>> exists.  Choosing any old unique index that happens to present itself
>> as the primary key feels wrong to me.
>> [stuff I don't disagree with]
>
> People lobbied vigorously to allow candidate keys before. I personally
> would never want to use anything but an actual primary key for
> replication, but there's other usecases than replication.

I like allowing candidate keys; I just don't like assuming that any
old one we select will be as good as any other.

>> All in all, it seems to me that we shouldn't try to punt.  Maybe we
>> should have something that works like ALTER TABLE name CLUSTER ON
>> index_name to configure which index should be used for logical
>> replication.  Possibly this same syntax could be used as ALTER
>> MATERIALIZED VIEW to set the candidate key for that case.
>
> I'd be fine with that, but I am also not particularly interested in it
> because I personally don't see much of a usecase.
> For replication ISTM the only case where there would be no primary key
> is a) initial load b) replacing the primary key by another index.

The latter is the case I'd be principally concerned about.  I once had
to change the columns that formed the key for a table being used in a
production web application; fortunately, it has traditionally not
mattered much whether a unique index is the primary key, so creating a
new unique index and dropping the old primary key was good enough.
But I would have wanted to control the point at which we changed our
notion of what the candidate key was, I think.

One other thought: you could just log the whole old tuple if there's
no key available.  That would let this work on tables that don't have
indexes.  Replaying the changes might be horribly complex and slow,
but extracting them would work.  If a replication plugin got <old
tuple, new tuple> with no information on keys, it could find *a* tuple
(not all tuples) that match the old tuple exactly and update each
column to the value from new tuple.  From a correctness point of view,
there's no issue there; it's all about efficiency.  But the user can
solve that problem whenever they like by indexing the destination
table.  It need not even be a unique index, so long as it's reasonably
selective.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.1

From
Andres Freund
Date:
On 2013-10-01 10:07:19 -0400, Robert Haas wrote:
> - It seems that HeapSatisfiesHOTandKeyUpdate is now
> HeapSatisfiesHOTandKeyandCandidateKeyUpdate.  Considering I think this
> was merely HeapSatisfiesHOTUpdate a year ago, it's hard not to be
> afraid that something unscalable is happening to this function.  On a
> related node, any overhead added here costs broadly; I'm not sure if
> there's enough to worry about.

Ok, I had to think a bit, but now I remember why I think these changes
are not really problem: Neither the addition of keys nor candidate keys
will add any additional comparisons since the columns compared for
candidate keys are a subset of the set of key columns which in turn are a
subset of the columns checked for HOT. Right?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.2

From
Steve Singer
Date:
On 09/30/2013 06:44 PM, Andres Freund wrote:
> Hi,
>
> The series from friday was a bit too buggy - obviously I was too
> tired. So here's a new one:

With this series I've also noticed
#2  0x00000000007741a7 in ExceptionalCondition (    conditionName=conditionName@entry=0x7c2908 "!(!(tuple->t_infomask &

0x1000))", errorType=errorType@entry=0x7acc70 "FailedAssertion",    fileName=fileName@entry=0x91767e "tqual.c",
lineNumber=lineNumber@entry=1608)at assert.c:54
 
54        abort();

 0x00000000007a4432 in HeapTupleSatisfiesMVCCDuringDecoding (    htup=0x10bfe48, snapshot=0x108b3d8, buffer=310) at
tqual.c:1608
#4  0x000000000049d6b7 in heap_hot_search_buffer (tid=tid@entry=0x10bfe4c,    relation=0x7fbebbcd89c0, buffer=310,
snapshot=0x10bfda0,   heapTuple=heapTuple@entry=0x10bfe48,    all_dead=all_dead@entry=0x7fff4aa3866f
"\001\370\375\v\001",   first_call=1 '\001') at heapam.c:1756
 
#5  0x00000000004a8174 in index_fetch_heap (scan=scan@entry=0x10bfdf8)    at indexam.c:539
#6  0x00000000004a82a8 in index_getnext (scan=0x10bfdf8,    direction=direction@entry=ForwardScanDirection) at
indexam.c:622
#7  0x00000000004a6fa9 in systable_getnext (sysscan=sysscan@entry=0x10bfd48)    at genam.c:343
#8  0x000000000076df40 in RelidByRelfilenode (reltablespace=0,    relfilenode=529775) at relfilenodemap.c:214
---Type <return> to continue, or q <return> to quit---
#9  0x0000000000664ad7 in ReorderBufferCommit (rb=0x1082d98,    xid=<optimized out>, commit_lsn=4638756800,
end_lsn=<optimizedout>,    commit_time=commit_time@entry=433970378426176) at reorderbuffer.c:1320
 

In addition to some of the other ones I've posted about.


> * fix pg_recvlogical makefile (Thanks Steve)
> * fix two commits not compiling properly without later changes (Thanks Kevin)
> * keep track of commit timestamps
> * fix bugs with option passing in test_logical_decoding
> * actually parse option values in test_decoding instead of just using the
>    option name
> * don't use anonymous structs in unions. That's compiler specific (msvc
>    and gcc) before C11 on which we can't rely. That unfortunately will
>    break output plugins because ReorderBufferChange need to qualify
>    old/new tuples now
> * improve error handling/cleanup in test_logical_decoding
> * some minor cleanups
>
> Patches attached, git tree updated.
>
> Greetings,
>
> Andres Freund
>
>
>




Re: logical changeset generation v6.1

From
Robert Haas
Date:
On Tue, Oct 1, 2013 at 1:56 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-10-01 10:07:19 -0400, Robert Haas wrote:
>> - It seems that HeapSatisfiesHOTandKeyUpdate is now
>> HeapSatisfiesHOTandKeyandCandidateKeyUpdate.  Considering I think this
>> was merely HeapSatisfiesHOTUpdate a year ago, it's hard not to be
>> afraid that something unscalable is happening to this function.  On a
>> related node, any overhead added here costs broadly; I'm not sure if
>> there's enough to worry about.
>
> Ok, I had to think a bit, but now I remember why I think these changes
> are not really problem: Neither the addition of keys nor candidate keys
> will add any additional comparisons since the columns compared for
> candidate keys are a subset of the set of key columns which in turn are a
> subset of the columns checked for HOT. Right?

TBH, my primary concern was with maintainability more than performance.

On performance, I think any time you add code it's going to cost
somehow.  However, it might not be enough to care about.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.1

From
Andres Freund
Date:
On 2013-10-02 10:56:38 -0400, Robert Haas wrote:
> On Tue, Oct 1, 2013 at 1:56 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2013-10-01 10:07:19 -0400, Robert Haas wrote:
> >> - It seems that HeapSatisfiesHOTandKeyUpdate is now
> >> HeapSatisfiesHOTandKeyandCandidateKeyUpdate.  Considering I think this
> >> was merely HeapSatisfiesHOTUpdate a year ago, it's hard not to be
> >> afraid that something unscalable is happening to this function.  On a
> >> related node, any overhead added here costs broadly; I'm not sure if
> >> there's enough to worry about.
> >
> > Ok, I had to think a bit, but now I remember why I think these changes
> > are not really problem: Neither the addition of keys nor candidate keys
> > will add any additional comparisons since the columns compared for
> > candidate keys are a subset of the set of key columns which in turn are a
> > subset of the columns checked for HOT. Right?
> 
> TBH, my primary concern was with maintainability more than performance.
> 
> On performance, I think any time you add code it's going to cost
> somehow.  However, it might not be enough to care about.

The easy alternative seems to be to call such a function multiple times
- which I think is prohibitive from a performance POV. More radically we
could simply compute the overall set/bitmap of differening columns and
then use bms_is_subset() to determine whether any index columns/key/ckey
columns changed. But that will do comparisons we don't do today...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.1

From
Robert Haas
Date:
On Wed, Oct 2, 2013 at 11:05 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-10-02 10:56:38 -0400, Robert Haas wrote:
>> On Tue, Oct 1, 2013 at 1:56 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > On 2013-10-01 10:07:19 -0400, Robert Haas wrote:
>> >> - It seems that HeapSatisfiesHOTandKeyUpdate is now
>> >> HeapSatisfiesHOTandKeyandCandidateKeyUpdate.  Considering I think this
>> >> was merely HeapSatisfiesHOTUpdate a year ago, it's hard not to be
>> >> afraid that something unscalable is happening to this function.  On a
>> >> related node, any overhead added here costs broadly; I'm not sure if
>> >> there's enough to worry about.
>> >
>> > Ok, I had to think a bit, but now I remember why I think these changes
>> > are not really problem: Neither the addition of keys nor candidate keys
>> > will add any additional comparisons since the columns compared for
>> > candidate keys are a subset of the set of key columns which in turn are a
>> > subset of the columns checked for HOT. Right?
>>
>> TBH, my primary concern was with maintainability more than performance.
>>
>> On performance, I think any time you add code it's going to cost
>> somehow.  However, it might not be enough to care about.
>
> The easy alternative seems to be to call such a function multiple times
> - which I think is prohibitive from a performance POV. More radically we
> could simply compute the overall set/bitmap of differening columns and
> then use bms_is_subset() to determine whether any index columns/key/ckey
> columns changed. But that will do comparisons we don't do today...

Yeah, there may be no better alternative to doing things as you've
done them here.  It just looks grotty, so I was hoping we had a better
idea.  Maybe not.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.1

From
Andres Freund
Date:
On 2013-10-02 11:06:59 -0400, Robert Haas wrote:
> On Wed, Oct 2, 2013 at 11:05 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2013-10-02 10:56:38 -0400, Robert Haas wrote:
> >> On Tue, Oct 1, 2013 at 1:56 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> > On 2013-10-01 10:07:19 -0400, Robert Haas wrote:
> >> >> - It seems that HeapSatisfiesHOTandKeyUpdate is now
> >> >> HeapSatisfiesHOTandKeyandCandidateKeyUpdate.  Considering I think this
> >> >> was merely HeapSatisfiesHOTUpdate a year ago, it's hard not to be
> >> >> afraid that something unscalable is happening to this function.  On a
> >> >> related node, any overhead added here costs broadly; I'm not sure if
> >> >> there's enough to worry about.
> >> >
> >> > Ok, I had to think a bit, but now I remember why I think these changes
> >> > are not really problem: Neither the addition of keys nor candidate keys
> >> > will add any additional comparisons since the columns compared for
> >> > candidate keys are a subset of the set of key columns which in turn are a
> >> > subset of the columns checked for HOT. Right?
> >>
> >> TBH, my primary concern was with maintainability more than performance.
> >>
> >> On performance, I think any time you add code it's going to cost
> >> somehow.  However, it might not be enough to care about.
> >
> > The easy alternative seems to be to call such a function multiple times
> > - which I think is prohibitive from a performance POV. More radically we
> > could simply compute the overall set/bitmap of differening columns and
> > then use bms_is_subset() to determine whether any index columns/key/ckey
> > columns changed. But that will do comparisons we don't do today...
> 
> Yeah, there may be no better alternative to doing things as you've
> done them here.  It just looks grotty, so I was hoping we had a better
> idea.  Maybe not.

Imo the code now looks easier to understand - which is not saying much -
than in 9.3/HEAD...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.2

From
Andres Freund
Date:
On 2013-10-01 16:11:47 -0400, Steve Singer wrote:
> On 09/30/2013 06:44 PM, Andres Freund wrote:
> >Hi,
> >
> >The series from friday was a bit too buggy - obviously I was too
> >tired. So here's a new one:
> 
> With this series I've also noticed
> #2  0x00000000007741a7 in ExceptionalCondition (
>     conditionName=conditionName@entry=0x7c2908 "!(!(tuple->t_infomask &
> 0x1000))", errorType=errorType@entry=0x7acc70 "FailedAssertion",
>     fileName=fileName@entry=0x91767e "tqual.c",
>     lineNumber=lineNumber@entry=1608) at assert.c:54
> 54        abort();
> 
> 
>  0x00000000007a4432 in HeapTupleSatisfiesMVCCDuringDecoding (
>     htup=0x10bfe48, snapshot=0x108b3d8, buffer=310) at tqual.c:1608
> #4  0x000000000049d6b7 in heap_hot_search_buffer (tid=tid@entry=0x10bfe4c,
>     relation=0x7fbebbcd89c0, buffer=310, snapshot=0x10bfda0,
>     heapTuple=heapTuple@entry=0x10bfe48,
>     all_dead=all_dead@entry=0x7fff4aa3866f "\001\370\375\v\001",
>     first_call=1 '\001') at heapam.c:1756
> #5  0x00000000004a8174 in index_fetch_heap (scan=scan@entry=0x10bfdf8)
>     at indexam.c:539
> #6  0x00000000004a82a8 in index_getnext (scan=0x10bfdf8,
>     direction=direction@entry=ForwardScanDirection) at indexam.c:622
> #7  0x00000000004a6fa9 in systable_getnext (sysscan=sysscan@entry=0x10bfd48)
>     at genam.c:343
> #8  0x000000000076df40 in RelidByRelfilenode (reltablespace=0,
>     relfilenode=529775) at relfilenodemap.c:214
> ---Type <return> to continue, or q <return> to quit---
> #9  0x0000000000664ad7 in ReorderBufferCommit (rb=0x1082d98,
>     xid=<optimized out>, commit_lsn=4638756800, end_lsn=<optimized out>,
>     commit_time=commit_time@entry=433970378426176) at reorderbuffer.c:1320

Does your code use SELECT FOR UPDATE/SHARE on system or treat_as_catalog
tables?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.2

From
Steve Singer
Date:
On 10/03/2013 12:38 PM, Andres Freund wrote:
> Does your code use SELECT FOR UPDATE/SHARE on system or 
> treat_as_catalog tables? Greetings, Andres Freund 

Yes.
It declares sl_table and sl_sequence and sl_set as catalog.

It does a
SELECT ......    from @NAMESPACE@.sl_table T, @NAMESPACE@.sl_set S,                "pg_catalog".pg_class PGC,
"pg_catalog".pg_namespacePGN,                "pg_catalog".pg_index PGX, "pg_catalog".pg_class PGXC
 
where ... for update

in the code being executed by the 'set add table'.

(We also do select for update commands in many other places during 
cluster configuration commands)




Re: logical changeset generation v6.2

From
Andres Freund
Date:
On 2013-10-03 13:03:07 -0400, Steve Singer wrote:
> On 10/03/2013 12:38 PM, Andres Freund wrote:
> >Does your code use SELECT FOR UPDATE/SHARE on system or treat_as_catalog
> >tables? Greetings, Andres Freund
> 
> Yes.
> It declares sl_table and sl_sequence and sl_set as catalog.
> 
> It does a
> SELECT ......
>     from @NAMESPACE@.sl_table T, @NAMESPACE@.sl_set S,
>                 "pg_catalog".pg_class PGC, "pg_catalog".pg_namespace PGN,
>                 "pg_catalog".pg_index PGX, "pg_catalog".pg_class PGXC
> where ... for update
> 
> in the code being executed by the 'set add table'.
> 
> (We also do select for update commands in many other places during cluster
> configuration commands)

Ok, there were a couple of bugs because I thought mxacts wouldn't need
to be supported. So far your testcase doesn't crash the database
anymore - it spews some internal errors though, so I am not sure if it's
entirely fixed for you.

Thanks for testing and helping!

I've pushed the changes to the git tree, they aren't squashed yet and
there's some further outstanding stuff, so I won't repost the series yet.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.1

From
Andres Freund
Date:
Hi,

On 2013-10-01 10:07:19 -0400, Robert Haas wrote:
> - It seems that HeapSatisfiesHOTandKeyUpdate is now
> HeapSatisfiesHOTandKeyandCandidateKeyUpdate.  Considering I think this
> was merely HeapSatisfiesHOTUpdate a year ago, it's hard not to be
> afraid that something unscalable is happening to this function.  On a
> related node, any overhead added here costs broadly; I'm not sure if
> there's enough to worry about.

I haven't changed anything here - ISTM so far nobody had a better
suggestion.

> - RelationIsDoingTimetravel is still a crappy name.  How about
> RelationRequiredForLogicalDecoding?  And maybe the reloption
> treat_as_catalog_table can become required_for_logical_decoding.

Hm. I don't really like the name, required seems to imply that it's
necessary to turn this on to get data replicated in that relation. How
about "accessible_during_logical_decoding" or "user_catalog_table"? The
latter would allow us to use it to add checks for user relations used in
indexes which need a treatment similar to enums.

> All in all, it seems to me that we shouldn't try to punt.  Maybe we
> should have something that works like ALTER TABLE name CLUSTER ON
> index_name to configure which index should be used for logical
> replication.  Possibly this same syntax could be used as ALTER
> MATERIALIZED VIEW to set the candidate key for that case.

How about using the current logic by default but allow to tune it
additionally with an option like that?

So, attached is the new version:
Changes:
* Fix issues you noticed except the above
* Handle multixacts on system tables
* Logical slots now are checksummed and contain a version and length
* Improve logic for increasing the "restart lsn", the point where we
  start to read WAL to decode from next time round
* Wait for xids in snapbuild, during the initial build
* s/RelationIsDoingTimetravel/RelationRequiredForLogicalDecoding/
* test_logical_decoding: confirm reception of changes at the end
* prohibit rewriting schema changes for treat_as_catalog_table relations
* add tests for dropping/adding primary/candidate keys
* PROCESS_INTERRUPTS whene reading wal for SQL SRF
* cleanup old serialized snapshots at check/restart points
* Add more isolationtester changes

Todo:
* rename treat_as_catalog_table, after agreeing on the new name
* rename remaining timetravel function names
* restrict SuspendDecodingSnapshots usage to RelationInitPhysicalAddr,
  that ought to be enough.
* add InLogicalDecoding() function.
* throw away older data when reading xl_running_xacts records, to deal
  with immediate shutdowns/crashes

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: logical changeset generation v6.2

From
Steve Singer
Date:
On 10/03/2013 04:00 PM, Andres Freund wrote:
> Ok, there were a couple of bugs because I thought mxacts wouldn't need 
> to be supported. So far your testcase doesn't crash the database 
> anymore - it spews some internal errors though, so I am not sure if 
> it's entirely fixed for you. Thanks for testing and helping! I've 
> pushed the changes to the git tree, they aren't squashed yet and 
> there's some further outstanding stuff, so I won't repost the series 
> yet. Greetings, Andres Freund 
When I run your updated version (from friday, not what you posted today) 
against a more recent version of my slony changes I can get the test 
case to pass 2/3 'rd of the time.  The failures are due to an issue in 
slon itself that I need to fix.

I see lots of
0LOG:  tx with subtxn 58836

but they seem harmless.

Thanks




Re: logical changeset generation v6.2

From
Andres Freund
Date:
On 2013-10-07 09:56:11 -0400, Steve Singer wrote:
> On 10/03/2013 04:00 PM, Andres Freund wrote:
> >Ok, there were a couple of bugs because I thought mxacts wouldn't need to
> >be supported. So far your testcase doesn't crash the database anymore - it
> >spews some internal errors though, so I am not sure if it's entirely fixed
> >for you. Thanks for testing and helping! I've pushed the changes to the
> >git tree, they aren't squashed yet and there's some further outstanding
> >stuff, so I won't repost the series yet. Greetings, Andres Freund
> When I run your updated version (from friday, not what you posted today)
> against a more recent version of my slony changes I can get the test case to
> pass 2/3 'rd of the time.  The failures are due to an issue in slon itself
> that I need to fix.

Cool.

> I see lots of
> 0LOG:  tx with subtxn 58836

Yes, those are completely harmless. And should, in fact, be removed. I
guess I should add the todo entry:
* make a pass over all elog/ereport an make sure they have the correct log level et al.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.1

From
Robert Haas
Date:
On Mon, Oct 7, 2013 at 9:32 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> - RelationIsDoingTimetravel is still a crappy name.  How about
>> RelationRequiredForLogicalDecoding?  And maybe the reloption
>> treat_as_catalog_table can become required_for_logical_decoding.
>
> Hm. I don't really like the name, required seems to imply that it's
> necessary to turn this on to get data replicated in that relation. How
> about "accessible_during_logical_decoding" or "user_catalog_table"? The
> latter would allow us to use it to add checks for user relations used in
> indexes which need a treatment similar to enums.

user_catalog_table is a pretty good description, but should we worry
about the fact that logical replication isn't mentioned in there
anywhere?

In what way do you feel that it's more clear to say *accessible
during* rather than *required for* logical decoding?

I was trying to make the naming consistent; i.e. if we have
RelationRequiredForLogicalDecoding then name the option to match.

>> All in all, it seems to me that we shouldn't try to punt.  Maybe we
>> should have something that works like ALTER TABLE name CLUSTER ON
>> index_name to configure which index should be used for logical
>> replication.  Possibly this same syntax could be used as ALTER
>> MATERIALIZED VIEW to set the candidate key for that case.
>
> How about using the current logic by default but allow to tune it
> additionally with an option like that?

I'm OK with defaulting to the primary key if there is one, but I think
that no other candidate key should be entertained unless the user
configures it.  I think the behavior we get without that will be just
too weird.  We could use the same logic you're proposing here for
CLUSTER, too, but we don't; that's because we've (IMHO, rightly)
decided that the choice of index is too important to be left to
chance.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.1

From
Steve Singer
Date:
On 10/07/2013 09:32 AM, Andres Freund wrote:
> Todo:
> * rename treat_as_catalog_table, after agreeing on the new name
> * rename remaining timetravel function names
> * restrict SuspendDecodingSnapshots usage to RelationInitPhysicalAddr,
>    that ought to be enough.
> * add InLogicalDecoding() function.
> * throw away older data when reading xl_running_xacts records, to deal
>    with immediate shutdowns/crashes

What is your current plan for decoding sequence updates?  Is this 
something that you were going to hold-off on supporting till a future 
version? ( know this was discussed a while ago but I don't remember 
where it stands now)
From a Slony point of view this isn't a big deal, I can continue to 
capture sequence changes in sl_seqlog when I create each SYNC event and 
then just replicate the INSERT statements in sl_seqlog via logical 
decoding.  I can see why someone building a replication system not based 
on the concept of a SYNC would have a harder time with this.

I am guessing we would want to pass sequence operations to the plugins 
as we encounter the WAL for them out-of-band of any transaction.   This 
would mean that a set of operations like

begin;
insert into a (id) values(4);
insert into a (id) values(nextval('some_seq'));
commit

would be replayed on the replicas as
setval('some_seq',100);
begin;
insert into a (id) values (4);
insert into a (id) values (100);
commit;





Re: logical changeset generation v6.1

From
Andres Freund
Date:
On 2013-10-08 15:02:39 -0400, Steve Singer wrote:
> On 10/07/2013 09:32 AM, Andres Freund wrote:
> >Todo:
> >* rename treat_as_catalog_table, after agreeing on the new name
> >* rename remaining timetravel function names
> >* restrict SuspendDecodingSnapshots usage to RelationInitPhysicalAddr,
> >   that ought to be enough.
> >* add InLogicalDecoding() function.
> >* throw away older data when reading xl_running_xacts records, to deal
> >   with immediate shutdowns/crashes
> 
> What is your current plan for decoding sequence updates?  Is this something
> that you were going to hold-off on supporting till a future version? ( know
> this was discussed a while ago but I don't remember where it stands now)

I don't plan to implement it as part of this - the optimizations in
sequences make it really unsuitable for that (nontransaction, allocated
in bulk, ...).
Simon had previously posted about "sequence AMs", and I have a prototype
patch that implements that concept (which needs considerable cleanup). I
plan to post about it whenever this is finished.

I think many replication solutions that care about sequences in a
nontrivial will want to implement their own sequence logic anyway, so I
think that's not a bad path.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.1

From
Andres Freund
Date:
On 2013-10-08 12:20:22 -0400, Robert Haas wrote:
> On Mon, Oct 7, 2013 at 9:32 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> - RelationIsDoingTimetravel is still a crappy name.  How about
> >> RelationRequiredForLogicalDecoding?  And maybe the reloption
> >> treat_as_catalog_table can become required_for_logical_decoding.
> >
> > Hm. I don't really like the name, required seems to imply that it's
> > necessary to turn this on to get data replicated in that relation. How
> > about "accessible_during_logical_decoding" or "user_catalog_table"? The
> > latter would allow us to use it to add checks for user relations used in
> > indexes which need a treatment similar to enums.
> 
> user_catalog_table is a pretty good description, but should we worry
> about the fact that logical replication isn't mentioned in there
> anywhere?

I personally don't worry about it, although I see why somebody could.

> In what way do you feel that it's more clear to say *accessible
> during* rather than *required for* logical decoding?

Because "required for" can easily be understood that you need to set it
if you want a table's changes to be replicated. Which is not the case...

> I was trying to make the naming consistent; i.e. if we have
> RelationRequiredForLogicalDecoding then name the option to match.

Maybe this should be RelationAccessibleInLogicalDecoding() then - that
seems like a better description anyway?

> >> All in all, it seems to me that we shouldn't try to punt.  Maybe we
> >> should have something that works like ALTER TABLE name CLUSTER ON
> >> index_name to configure which index should be used for logical
> >> replication.  Possibly this same syntax could be used as ALTER
> >> MATERIALIZED VIEW to set the candidate key for that case.
> >
> > How about using the current logic by default but allow to tune it
> > additionally with an option like that?
> 
> I'm OK with defaulting to the primary key if there is one, but I think
> that no other candidate key should be entertained unless the user
> configures it.  I think the behavior we get without that will be just
> too weird.  We could use the same logic you're proposing here for
> CLUSTER, too, but we don't; that's because we've (IMHO, rightly)
> decided that the choice of index is too important to be left to
> chance.

I don't understand why this would be a good path. If you DELETE/UPDATE
and you don't have a primary key you get something that definitely
identifies the row with the current behaviour. It might not be the best
thing, but it sure is better than nothing. E.g. for auditing it's
probably quite sufficient to just use any of the candidate keys if
there (temporarily) is no primary key.
If you implement a replication solution and don't want that behaviour
there, you are free to guard against it there - which is a good thing
to do.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.2

From
Robert Haas
Date:
On Mon, Sep 30, 2013 at 6:44 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> The series from friday was a bit too buggy - obviously I was too
> tired. So here's a new one:
>
> * fix pg_recvlogical makefile (Thanks Steve)
> * fix two commits not compiling properly without later changes (Thanks Kevin)
> * keep track of commit timestamps
> * fix bugs with option passing in test_logical_decoding
> * actually parse option values in test_decoding instead of just using the
>   option name
> * don't use anonymous structs in unions. That's compiler specific (msvc
>   and gcc) before C11 on which we can't rely. That unfortunately will
>   break output plugins because ReorderBufferChange need to qualify
>   old/new tuples now
> * improve error handling/cleanup in test_logical_decoding
> * some minor cleanups
>
> Patches attached, git tree updated.

I spent some time looking at the sample plugin (patch 9/12).  Here are
some review comments:

- I think that the decoding plugin interface should work more like the
foreign data wrapper interface.  Instead of using pg_dlsym to look up
fixed names, I think there should be a struct of function pointers
that gets filled in and registered somehow.

- pg_decode_init() only warns when it encounters an unknown option.
An error seems more appropriate.

- Still wondering how we'll use this from a bgworker.

- The output format doesn't look very machine-parseable.   I really
think we ought to provide something that is.  Maybe a CSV-like format,
or maybe something else, but I don't see why someone who wants to do
change logging should be forced to write and install C code.  If
something like Bucardo can run on an unmodified system and extract
change-sets this way without needing a .so file, that's going to be a
huge win for usability.

Other than that, I don't have too many concerns about the plugin
interface.  I think it provides useful flexibility and it generally
seems well-designed.  I hope in the future we'll be able to decode
transactions on the fly instead of waiting until commit time, but I've
resigned myself to the fact that we may not get that in version one.

More generally on this patch set, if I'm going to be committing any of
this, I'd prefer to start with what is currently patches 3 and 4, once
we reach agreement on those.

Are we hoping to get any of this committed for this CF?  If so, let's
make a plan to get that done; time is short.  If not, let's update the
CF app accordingly.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.2

From
Andres Freund
Date:
Hi Robert,

On 2013-10-09 14:49:46 -0400, Robert Haas wrote:
> I spent some time looking at the sample plugin (patch 9/12).  Here are
> some review comments:
> 
> - I think that the decoding plugin interface should work more like the
> foreign data wrapper interface.  Instead of using pg_dlsym to look up
> fixed names, I think there should be a struct of function pointers
> that gets filled in and registered somehow.

You mean something like CREATE OUTPUT PLUGIN registering a function with
an INTERNAL return value returning a filled struct? I thought about
that, but it seemed more complex. Happy to change it though if it's
preferred.

> - pg_decode_init() only warns when it encounters an unknown option.
> An error seems more appropriate.

Fine with me. I think I just made it a warning because I wanted to
experiment with options.

> - Still wondering how we'll use this from a bgworker.

Simplified code to consume data:

LogicalDecodingReAcquireSlot(NameStr(*name));
ctx = CreateLogicalDecodingContext(MyLogicalDecodingSlot, false /* not initial call */,
MyLogicalDecodingSlot->confirmed_flush,                                  options,
logical_read_local_xlog_page,                                 LogicalOutputPrepareWrite,
 LogicalOutputWrite);
 
...
while (true)
{XLogRecord *record;char       *errm = NULL;
record = XLogReadRecord(ctx->reader, startptr, &errm);
...       DecodeRecordIntoReorderBuffer(ctx, &buf);
}

/* at the end or better ever commit or such */
LogicalConfirmReceivedLocation(/* whatever you consumed */);

LogicalDecodingReleaseSlot();


> - The output format doesn't look very machine-parseable.   I really
> think we ought to provide something that is.  Maybe a CSV-like format,
> or maybe something else, but I don't see why someone who wants to do
> change logging should be forced to write and install C code.  If
> something like Bucardo can run on an unmodified system and extract
> change-sets this way without needing a .so file, that's going to be a
> huge win for usability.

We can change the current format but I really see little to no chance of
agreeing on a replication format that's serviceable to several solutions
short term. Once we've gained some experience - maybe even this cycle -
that might be different.

> More generally on this patch set, if I'm going to be committing any of
> this, I'd prefer to start with what is currently patches 3 and 4, once
> we reach agreement on those.

Sounds like a reasonable start.

> Are we hoping to get any of this committed for this CF?  If so, let's
> make a plan to get that done; time is short.  If not, let's update the
> CF app accordingly.

I'd really like to do so. I am travelling atm, but I will be back
tomorrow evening and will push an updated patch this weekend. The issue
I know of in the latest patches at
http://www.postgresql.org/message-id/20131007133232.GA15202@awork2.anarazel.de
is renaming from http://www.postgresql.org/message-id/20131008194758.GB3718183@alap2.anarazel.de

Do you know of anything else in the patches you're referring to?

Thanks,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.2

From
Robert Haas
Date:
On Thu, Oct 10, 2013 at 7:11 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> Hi Robert,
>
> On 2013-10-09 14:49:46 -0400, Robert Haas wrote:
>> I spent some time looking at the sample plugin (patch 9/12).  Here are
>> some review comments:
>>
>> - I think that the decoding plugin interface should work more like the
>> foreign data wrapper interface.  Instead of using pg_dlsym to look up
>> fixed names, I think there should be a struct of function pointers
>> that gets filled in and registered somehow.
>
> You mean something like CREATE OUTPUT PLUGIN registering a function with
> an INTERNAL return value returning a filled struct? I thought about
> that, but it seemed more complex. Happy to change it though if it's
> preferred.

I don't see any need for SQL syntax.  I was just thinking that the
_PG_init function could fill in a structure and then call
RegisterLogicalReplicationOutputPlugin(&mystruct).

>> - Still wondering how we'll use this from a bgworker.
>
> Simplified code to consume data:

Cool.  As long as that use case is supported, I'm happy; I just want
to make sure we're not presuming that there must be an external
client.

>> - The output format doesn't look very machine-parseable.   I really
>> think we ought to provide something that is.  Maybe a CSV-like format,
>> or maybe something else, but I don't see why someone who wants to do
>> change logging should be forced to write and install C code.  If
>> something like Bucardo can run on an unmodified system and extract
>> change-sets this way without needing a .so file, that's going to be a
>> huge win for usability.
>
> We can change the current format but I really see little to no chance of
> agreeing on a replication format that's serviceable to several solutions
> short term. Once we've gained some experience - maybe even this cycle -
> that might be different.

I don't see why you're so pessimistic about that.  I know you haven't
worked it out yet, but what makes this harder than sitting down and
designing something?

>> More generally on this patch set, if I'm going to be committing any of
>> this, I'd prefer to start with what is currently patches 3 and 4, once
>> we reach agreement on those.
>
> Sounds like a reasonable start.

Perhaps you could reshuffle the order of the series, if it's not too much work.

>> Are we hoping to get any of this committed for this CF?  If so, let's
>> make a plan to get that done; time is short.  If not, let's update the
>> CF app accordingly.
>
> I'd really like to do so. I am travelling atm, but I will be back
> tomorrow evening and will push an updated patch this weekend. The issue
> I know of in the latest patches at
> http://www.postgresql.org/message-id/20131007133232.GA15202@awork2.anarazel.de
> is renaming from http://www.postgresql.org/message-id/20131008194758.GB3718183@alap2.anarazel.de

I'm a bit nervous about the way the combo CID logging.  I would have
thought that you would emit one record per combo CID, but what you're
apparently doing is emitting one record per heap tuple that uses a
combo CID.  For some reason that feels like an abuse (and maybe kinda
inefficient, too).

Either way, I also wonder what happens if a (logical?) checkpoint
occurs between the combo CID record and the heap record to which it
refers, or how you prevent that from happening.  What if the combo CID
record is written and the transaction aborts before writing the heap
record (maybe without writing an abort record to WAL)?

What are the performance implications of this additional WAL logging?
What's the worst case?  What's the typical case?  Does it have a
noticeable overhead when wal_level < logical?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.2

From
Andres Freund
Date:
Hi,

On 2013-10-11 09:08:43 -0400, Robert Haas wrote:
> On Thu, Oct 10, 2013 at 7:11 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2013-10-09 14:49:46 -0400, Robert Haas wrote:
> >> I spent some time looking at the sample plugin (patch 9/12).  Here are
> >> some review comments:
> >>
> >> - I think that the decoding plugin interface should work more like the
> >> foreign data wrapper interface.  Instead of using pg_dlsym to look up
> >> fixed names, I think there should be a struct of function pointers
> >> that gets filled in and registered somehow.
> >
> > You mean something like CREATE OUTPUT PLUGIN registering a function with
> > an INTERNAL return value returning a filled struct? I thought about
> > that, but it seemed more complex. Happy to change it though if it's
> > preferred.
> 
> I don't see any need for SQL syntax.  I was just thinking that the
> _PG_init function could fill in a structure and then call
> RegisterLogicalReplicationOutputPlugin(&mystruct).

Hm. We can do that, but what'd be the advantage of that? The current
model will correctly handle things like a'shared_preload_libraries'ed
output plugin, because its _PG_init() will not register it. With the
handling done in _PG_init() there would be two.
Being able to use the same .so for output plugin handling and some other
replication solution specific stuff is imo useful.

> >> - Still wondering how we'll use this from a bgworker.
> >
> > Simplified code to consume data:
> 
> Cool.  As long as that use case is supported, I'm happy; I just want
> to make sure we're not presuming that there must be an external
> client.

The included testcases are written using the SQL SRF interface, which in
turn is a usecase that doesn't use walsenders and such, so I hope we
won't break it accidentally ;)

> >> - The output format doesn't look very machine-parseable.   I really
> >> think we ought to provide something that is.  Maybe a CSV-like format,
> >> or maybe something else, but I don't see why someone who wants to do
> >> change logging should be forced to write and install C code.  If
> >> something like Bucardo can run on an unmodified system and extract
> >> change-sets this way without needing a .so file, that's going to be a
> >> huge win for usability.
> >
> > We can change the current format but I really see little to no chance of
> > agreeing on a replication format that's serviceable to several solutions
> > short term. Once we've gained some experience - maybe even this cycle -
> > that might be different.
> 
> I don't see why you're so pessimistic about that.  I know you haven't
> worked it out yet, but what makes this harder than sitting down and
> designing something?

Because every replication solution has different requirements for the
format and they will want filter the output stream with regard to their
own configuration.
E.g. bucardo will want to include the transaction timestamp for conflict
resolution and such.

> >> More generally on this patch set, if I'm going to be committing any of
> >> this, I'd prefer to start with what is currently patches 3 and 4, once
> >> we reach agreement on those.
> >
> > Sounds like a reasonable start.
> 
> Perhaps you could reshuffle the order of the series, if it's not too much work.

Sure, that's no problem. Do I understand correctly that you'd like
wal_decoding: Add information about a tables primary key to struct RelationData
wal_decoding: Add wal_level = logical and log data required for logical decoding

earlier?

> > I'd really like to do so. I am travelling atm, but I will be back
> > tomorrow evening and will push an updated patch this weekend. The issue
> > I know of in the latest patches at
> > http://www.postgresql.org/message-id/20131007133232.GA15202@awork2.anarazel.de
> > is renaming from http://www.postgresql.org/message-id/20131008194758.GB3718183@alap2.anarazel.de
> 
> I'm a bit nervous about the way the combo CID logging.  I would have
> thought that you would emit one record per combo CID, but what you're
> apparently doing is emitting one record per heap tuple that uses a
> combo CID.

I thought and implemented that in the beginning. Unfortunately it's not
enough :(. That's probably the issue that took me longest to understand
in this patchseries...

Combocids can only fix the case where a transaction actually has create
a combocid:

1) TX1: INSERT id = 1 at 0/1: (xmin = 1, xmax=Invalid, cmin = 55, cmax = Invalid)
2) TX2: DELETE id = 1 at 0/1: (xmin = 1, xmax=2, cmin = Invalid, cmax = 1)

So, if we're decoding data that needs to lookup those rows in TX1 or TX2
we both times need access to cmin and cmax, but neither transaction will
have created a multixact. That can only be an issue in transaction with
catalog modifications.

A slightly more complex variant also requires this if combocids are
involved:

1) TX1: INSERT id = 1 at 0/1: (xmin = 1, xmax=Invalid, cmin = 55, cmax = Invalid)
2) TX1: SAVEPOINT foo;
3) TX1-2: UPDATE id = 1 at 0/1: (xmin = 1, xmax=2, cmin = 55, cmax = 56, combo=123)                   new at 0/1: (xmin
=2, xmax=Invalid, cmin = 57, cmax = Invalid)
 
4) TX1-2: ROLLBACK TO foo;
5) TX3: DELETE id = 1 at 0/1: (xmin = 1, xmax=3, cmin = Invalid, cmax = 1)

If you're decoding data that's been inserted after 1) you still need to
see cmin = 55. At 5) you need to see cmin = Invalid.

So just remembering the correct value for each tuple that's needed for a
specific transaction seems like the simplest way here.

> Either way, I also wonder what happens if a (logical?) checkpoint
> occurs between the combo CID record and the heap record to which it
> refers, or how you prevent that from happening.

Logical checkpoints contain the so called 'restart_decoding' LSN, which
is defined to be the LSN at which we can restart reading WAL and are
guaranteed to be able to decode all transactions that haven't been
confirmed as received.
Normal checkpoints shouldn't play any role here.

> What if the combo CID record is written and the transaction aborts
> before writing the heap record (maybe without writing an abort record
> to WAL)?

In the currently implemented model where we log (relfilenode, ctid,
cmin, cmax) we only ever need access to those rows when decoding data
changes from *within* the catalog modifying toplevel transaction. Never
in other toplevle transactions.

> What are the performance implications of this additional WAL logging?
> What's the worst case?

I haven't been able to notice any difference above jitter for anything
that touches actual relations. The overhead there is far bigger than
that single XLogInsert().
Maybe there's something that doesn't interact with much of the system
where the effort is noticeable and which is actually relvant for
performance? I couldn't really think of something.

> noticeable overhead when wal_level < logical?

Couldn't measure anything either, which is not surprising that I
couldn't measure the overhead in the first place.

I've done some parallel INSERT/DELETE pgbenching around the
wal_level=logical and I couldn't measure any overhead with it
disabled. With wal_level = logical, UPDATEs and DELETEs do get a bit
slower, but that's to be expected.

It'd probably not hurt to redo those benchmarks to make sure...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.4

From
Andres Freund
Date:
Hi,

Attached you can find version 6.4 of the patchset:

* reordered so the patches Robert wants to apply first are first
* renamed treat_as_catalog_table to user_catalog_table
* renamed RelationRequiredForLogicalDecoding to RelationIsAccessibleInLogicalDecoding
* moved two hunks to better fitting patches

I am working on the longer TODOs from the last version now, but they
don't affect the first patches.

Greetings,

Andres Freund

PS: git rebase -i -x /path/to/testscript is cool

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: logical changeset generation v6.2

From
Robert Haas
Date:
On Fri, Oct 11, 2013 at 12:57 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> I don't see any need for SQL syntax.  I was just thinking that the
>> _PG_init function could fill in a structure and then call
>> RegisterLogicalReplicationOutputPlugin(&mystruct).
>
> Hm. We can do that, but what'd be the advantage of that? The current
> model will correctly handle things like a'shared_preload_libraries'ed
> output plugin, because its _PG_init() will not register it. With the
> handling done in _PG_init() there would be two.
> Being able to use the same .so for output plugin handling and some other
> replication solution specific stuff is imo useful.

Well, I just think relying on specific symbol names in the .so file is
kind of unfortunate.  It means that, for example, you can't have
multiple output plugins provided by a single .so.  And in general I
think it's something that we've tried to minimize.

>> I don't see why you're so pessimistic about that.  I know you haven't
>> worked it out yet, but what makes this harder than sitting down and
>> designing something?
>
> Because every replication solution has different requirements for the
> format and they will want filter the output stream with regard to their
> own configuration.
> E.g. bucardo will want to include the transaction timestamp for conflict
> resolution and such.

But there's only so much information available here.  Why not just
have a format that logs it all?

> Sure, that's no problem. Do I understand correctly that you'd like
> wal_decoding: Add information about a tables primary key to struct RelationData
> wal_decoding: Add wal_level = logical and log data required for logical decoding
>
> earlier?

Yes.

>> > I'd really like to do so. I am travelling atm, but I will be back
>> > tomorrow evening and will push an updated patch this weekend. The issue
>> > I know of in the latest patches at
>> > http://www.postgresql.org/message-id/20131007133232.GA15202@awork2.anarazel.de
>> > is renaming from http://www.postgresql.org/message-id/20131008194758.GB3718183@alap2.anarazel.de
>>
>> I'm a bit nervous about the way the combo CID logging.  I would have
>> thought that you would emit one record per combo CID, but what you're
>> apparently doing is emitting one record per heap tuple that uses a
>> combo CID.
>
> I thought and implemented that in the beginning. Unfortunately it's not
> enough :(. That's probably the issue that took me longest to understand
> in this patchseries...
>
> Combocids can only fix the case where a transaction actually has create
> a combocid:
>
> 1) TX1: INSERT id = 1 at 0/1: (xmin = 1, xmax=Invalid, cmin = 55, cmax = Invalid)
> 2) TX2: DELETE id = 1 at 0/1: (xmin = 1, xmax=2, cmin = Invalid, cmax = 1)
>
> So, if we're decoding data that needs to lookup those rows in TX1 or TX2
> we both times need access to cmin and cmax, but neither transaction will
> have created a multixact. That can only be an issue in transaction with
> catalog modifications.

Oh, yuck.  So that means you have to write an extra WAL record for
EVERY heap insert, update, or delete to a catalog table?  OUCH.

> Couldn't measure anything either, which is not surprising that I
> couldn't measure the overhead in the first place.
>
> I've done some parallel INSERT/DELETE pgbenching around the
> wal_level=logical and I couldn't measure any overhead with it
> disabled. With wal_level = logical, UPDATEs and DELETEs do get a bit
> slower, but that's to be expected.
>
> It'd probably not hurt to redo those benchmarks to make sure...

Yes, I think it would be good to characterize it more precisely than
"a bit", so people know what to expect.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.2

From
Andres Freund
Date:
On 2013-10-14 09:36:03 -0400, Robert Haas wrote:
> On Fri, Oct 11, 2013 at 12:57 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> I don't see any need for SQL syntax.  I was just thinking that the
> >> _PG_init function could fill in a structure and then call
> >> RegisterLogicalReplicationOutputPlugin(&mystruct).
> >
> > Hm. We can do that, but what'd be the advantage of that? The current
> > model will correctly handle things like a'shared_preload_libraries'ed
> > output plugin, because its _PG_init() will not register it. With the
> > handling done in _PG_init() there would be two.
> > Being able to use the same .so for output plugin handling and some other
> > replication solution specific stuff is imo useful.
> 
> Well, I just think relying on specific symbol names in the .so file is
> kind of unfortunate.  It means that, for example, you can't have
> multiple output plugins provided by a single .so.  And in general I
> think it's something that we've tried to minimize.

But that's not really different when you rely on _PG_init doing it's
thing, right?

> >> I don't see why you're so pessimistic about that.  I know you haven't
> >> worked it out yet, but what makes this harder than sitting down and
> >> designing something?
> >
> > Because every replication solution has different requirements for the
> > format and they will want filter the output stream with regard to their
> > own configuration.
> > E.g. bucardo will want to include the transaction timestamp for conflict
> > resolution and such.
> 
> But there's only so much information available here.  Why not just
> have a format that logs it all?

Because we do not know what "all" is? Also, how would we handle
replication sets and such that all of the existing replication solutions
have generically?

> > Sure, that's no problem. Do I understand correctly that you'd like
> > wal_decoding: Add information about a tables primary key to struct RelationData
> > wal_decoding: Add wal_level = logical and log data required for logical decoding
> >
> > earlier?
> 
> Yes.

That's done. Hope the new order makes sense.

> > So, if we're decoding data that needs to lookup those rows in TX1 or TX2
> > we both times need access to cmin and cmax, but neither transaction will
> > have created a multixact. That can only be an issue in transaction with
> > catalog modifications.

> Oh, yuck.  So that means you have to write an extra WAL record for
> EVERY heap insert, update, or delete to a catalog table?  OUCH.

Yes. We could integrate it into the main record without too many
problems, but it didn't seem like an important optimization and it would
have higher chances of slowing down wal_level < logical.

> > It'd probably not hurt to redo those benchmarks to make sure...
> 
> Yes, I think it would be good to characterize it more precisely than
> "a bit", so people know what to expect.

A "bit" was below the 3% range for loops of adding columns.

So, any tests you'd like to see?
* loop around CREATE TABLE/DROP TABLE
* loop around ALTER TABLE ... ADD COLUMN
* loop around CREATE FUNCTION/DROP FUNCTION

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.2

From
Andres Freund
Date:
On 2013-10-14 15:51:14 +0200, Andres Freund wrote:
> > > It'd probably not hurt to redo those benchmarks to make sure...
> >
> > Yes, I think it would be good to characterize it more precisely than
> > "a bit", so people know what to expect.
>
> A "bit" was below the 3% range for loops of adding columns.
>
> So, any tests you'd like to see?
> * loop around CREATE TABLE/DROP TABLE
> * loop around ALTER TABLE ... ADD COLUMN
> * loop around CREATE FUNCTION/DROP FUNCTION

So, see the attatched benchmark skript. I've always done using a disk
bound and a memory bound (using eatmydata, preventing fsyncs) run.

* unpatched run, wal_level = hot_standby, eatmydata
* unpatched run, wal_level = hot_standby

* patched run, wal_level = hot_standby, eatmydata
* patched run, wal_level = hot_standby

* patched run, wal_level = logical, eatmydata
* patched run, wal_level = logical

Based on those results, there's no difference above noise for
wal_level=hot_standby, with or without the patch. With wal_level=logical
there's a measurable increase in wal traffic (~12-17%), but no
performance decrease above noise.

>From my POV that's ok, those are really crazy catalog workloads.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: logical changeset generation v6.2

From
Robert Haas
Date:
On Mon, Oct 14, 2013 at 9:51 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> Well, I just think relying on specific symbol names in the .so file is
>> kind of unfortunate.  It means that, for example, you can't have
>> multiple output plugins provided by a single .so.  And in general I
>> think it's something that we've tried to minimize.
>
> But that's not really different when you rely on _PG_init doing it's
> thing, right?

Sure, that's true.  But in general I think magic symbol names aren't a
particularly good design.

>> But there's only so much information available here.  Why not just
>> have a format that logs it all?
>
> Because we do not know what "all" is? Also, how would we handle
> replication sets and such that all of the existing replication solutions
> have generically?

I don't see how you can fail to know what "all" is.  There's only a
certain set of facts available.  I mean you could log irrelevant crap
like a random number that you just picked or the sum of all numeric
values in the column, but nobody's likely to want that.  What people
are going to want is the operation performed (insert, update, or
delete), all the values in the new tuple, the key values from the old
tuple, the transaction ID, and maybe some meta-information about the
transaction (such as the commit timestamp).  What I'd probably do is
emit the data in CSV format, with the first column of each line being
a single character indicating what sort of row this is: H means a
header row, defining the format of subsequent rows
(H,table_name,new_column1,...,new_columnj,old_key_column1,...,old_key_columnk;
a new header row is emitted only when the column list changes); I, U,
or D means an insert, update, or delete, with column 2 being the
transaction ID, column 3 being the table name, and the remaining
columns matching the last header row for emitted for that table, T
means meta-information about a transaction, whatever we have (e.g.
T,txn_id,commit_time).  There's probably some further tweaking of that
that could be done, and I might be overlooking some salient details,
like maybe we want to indicate the column types as well as their
names, but the range of things that someone can want to do here is not
unlimited.  The point, for me anyway, is that someone can write a
crappy Perl script to apply changes from a file like this in a day.
My contention is that there are a lot of people who will want to do
just that, for one reason or another.  The plugin interface has
awesome power and flexibility, and really high-performance replication
solutions will really benefit from that.  But regular people don't
want to write C code; they just want to write a crappy Perl script.
And I think we can facilitate that without too much work.

>> Oh, yuck.  So that means you have to write an extra WAL record for
>> EVERY heap insert, update, or delete to a catalog table?  OUCH.
>
> Yes. We could integrate it into the main record without too many
> problems, but it didn't seem like an important optimization and it would
> have higher chances of slowing down wal_level < logical.

Hmm.  I don't know whether that's an important optimization or not.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.2

From
Robert Haas
Date:
On Mon, Oct 14, 2013 at 5:07 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> So, see the attatched benchmark skript. I've always done using a disk
> bound and a memory bound (using eatmydata, preventing fsyncs) run.
>
> * unpatched run, wal_level = hot_standby, eatmydata
> * unpatched run, wal_level = hot_standby
>
> * patched run, wal_level = hot_standby, eatmydata
> * patched run, wal_level = hot_standby
>
> * patched run, wal_level = logical, eatmydata
> * patched run, wal_level = logical
>
> Based on those results, there's no difference above noise for
> wal_level=hot_standby, with or without the patch. With wal_level=logical
> there's a measurable increase in wal traffic (~12-17%), but no
> performance decrease above noise.
>
> From my POV that's ok, those are really crazy catalog workloads.

Any increase in WAL traffic will translate into a performance hit once
the I/O channel becomes saturated, but I agree those numbers don't
sound terrible for that faily-brutal test case. Actually, I was more
concerned about the hit on non-catalog workloads.  pgbench isn't a
good test because the key column is so narrow; but suppose we have a
table like (a text, b integer, c text) where (a, c) is the primary key
and those strings are typically pretty long - say just short enough
that we can still index the column.  It'd be worth testing both
workloads where the primary key doesn't change (so the only overhead
is figuring out that we need not log it) and those where it does
(where we're double-logging most of the tuple).  I assume the latter
has to produce a significant hit to WAL volume, and I don't think
there's much we can do about that; but the former had better be nearly
free.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.2

From
Andres Freund
Date:
On 2013-10-15 08:42:20 -0400, Robert Haas wrote:
> On Mon, Oct 14, 2013 at 9:51 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> Well, I just think relying on specific symbol names in the .so file is
> >> kind of unfortunate.  It means that, for example, you can't have
> >> multiple output plugins provided by a single .so.  And in general I
> >> think it's something that we've tried to minimize.
> >
> > But that's not really different when you rely on _PG_init doing it's
> > thing, right?
> 
> Sure, that's true.  But in general I think magic symbol names aren't a
> particularly good design.

It allows you to use the shared libary both as a normal extension loaded
via shared_preload_library or adhoc and as an output plugin which seems
like a sensible goal.
We could have a single _PG_init_output_plugin() symbol that fills in
such a struct which would then not conflict with using the .so
independently. If you prefer that I'll change things around.

We can't do something like 'output_plugin_in_progress' before calling
_PG_init() because _PG_init() won't be called again if the shared object
is already loaded...

> >> But there's only so much information available here.  Why not just
> >> have a format that logs it all?
> >
> > Because we do not know what "all" is? Also, how would we handle
> > replication sets and such that all of the existing replication solutions
> > have generically?
> 
> I don't see how you can fail to know what "all" is.  There's only a
> certain set of facts available.  I mean you could log irrelevant crap
> like a random number that you just picked or the sum of all numeric
> values in the column, but nobody's likely to want that.  What people
> are going to want is the operation performed (insert, update, or
> delete), all the values in the new tuple, the key values from the old
> tuple, the transaction ID, and maybe some meta-information about the
> transaction (such as the commit timestamp).

Some will want all column names included because that makes replication
into different schemas/databases easier, others won't because it makes
replicating the data more complicated and expensive.
Lots will want the primary key as a separate set of columns even for
inserts, others not.
There's also datatypes of values and null representation.

> What I'd probably do is
> emit the data in CSV format, with the first column of each line being
> a single character indicating what sort of row this is: H means a
> header row, defining the format of subsequent rows
> (H,table_name,new_column1,...,new_columnj,old_key_column1,...,old_key_columnk;
> a new header row is emitted only when the column list changes); I, U,
> or D means an insert, update, or delete, with column 2 being the
> transaction ID, column 3 being the table name, and the remaining
> columns matching the last header row for emitted for that table, T
> means meta-information about a transaction, whatever we have (e.g.
> T,txn_id,commit_time).

There's two issues I have with this:
a) CSV seems like a bad format for this. If a transaction inserts into
multiple tables the number of columns will constantly change. Many CSV
parsers don't deal with that all too gracefully. E.g. you can't even
load the data into another postgres database as an audit log.

If we go for CSV I think we should put the entire primary key as one
column (containing all the columns) and the entire row another.

We also don't have any nice facilities for actually writing CSV - so
we'll need to start extracting escaping code from COPY. In the end all
that will make the output plugin very hard to use as an example because
the code will get more complicated.

b) Emitting new row descriptors everytime the schema changes will
require keeping track of the schema. I think that won't be trivial. It
also makes consumption of the data more complicated in comparison to
including the description with every row.

Both are even more true once we extend the format to support streaming
of transactions while they are performed.

> But regular people don't want to write C code; they just want to write
> a crappy Perl script.  And I think we can facilitate that without too
> much work.

I think the generic output plugin should be a separate one from the
example one (which is the one included in the patchset).

> >> Oh, yuck.  So that means you have to write an extra WAL record for
> >> EVERY heap insert, update, or delete to a catalog table?  OUCH.
> >
> > Yes. We could integrate it into the main record without too many
> > problems, but it didn't seem like an important optimization and it would
> > have higher chances of slowing down wal_level < logical.
> 
> Hmm.  I don't know whether that's an important optimization or not.

Based on the benchmark I'd say no. If we discover we need to go there we
can do so later. I don't forsee this to be really problematic.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.2

From
Andres Freund
Date:
On 2013-10-15 08:49:26 -0400, Robert Haas wrote:
> On Mon, Oct 14, 2013 at 5:07 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > So, see the attatched benchmark skript. I've always done using a disk
> > bound and a memory bound (using eatmydata, preventing fsyncs) run.
> >
> > * unpatched run, wal_level = hot_standby, eatmydata
> > * unpatched run, wal_level = hot_standby
> >
> > * patched run, wal_level = hot_standby, eatmydata
> > * patched run, wal_level = hot_standby
> >
> > * patched run, wal_level = logical, eatmydata
> > * patched run, wal_level = logical
> >
> > Based on those results, there's no difference above noise for
> > wal_level=hot_standby, with or without the patch. With wal_level=logical
> > there's a measurable increase in wal traffic (~12-17%), but no
> > performance decrease above noise.
> >
> > From my POV that's ok, those are really crazy catalog workloads.
> 
> Any increase in WAL traffic will translate into a performance hit once
> the I/O channel becomes saturated, but I agree those numbers don't
> sound terrible for that faily-brutal test case.

Well, the parallel workloads were fsync saturated although probably not
throughput, that's why I added them. But yes, it's not the same as a
throughput saturated IO channel.
Probably the worst case real-world workload is one that uses lots and
lots of ON COMMIT DROP temporary tables.

> Actually, I was more concerned about the hit on non-catalog workloads.  pgbench isn't a
> good test because the key column is so narrow; but suppose we have a
> table like (a text, b integer, c text) where (a, c) is the primary key
> and those strings are typically pretty long - say just short enough
> that we can still index the column.  It'd be worth testing both
> workloads where the primary key doesn't change (so the only overhead
> is figuring out that we need not log it) and those where it does
> (where we're double-logging most of the tuple).  I assume the latter
> has to produce a significant hit to WAL volume, and I don't think
> there's much we can do about that; but the former had better be nearly
> free.

Ah, ok. Then I misunderstood you.

Is there a specific overhead you are "afraid" of in the
pkey-doesn't-change scenario? The changed wal logging (buffer in a
separate rdata entry) or the check whether the primary key has changed?

The only way I have been able to measure differences in that scenario
was to load a table with a low fillfactor and wide tuples, checkpoint,
and then update lots of rows. On wal_level=logical that will result in
full-page-images and tuple data being logged which can be noticeable if
you have really large tuples, even if the pkey doesn't change.

We could optimize that by not actually logging the tuple data in that
case but just include the tid so we could extract things from the Bkp
block ourselves. But that will complicate the code and doesn't yet seem
warranted.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.2

From
Andres Freund
Date:
On 2013-10-15 15:17:58 +0200, Andres Freund wrote:
> If we go for CSV I think we should put the entire primary key as one
> column (containing all the columns) and the entire row another.

What about columns like:
* action B|I|U|D|C

* xid
* timestamp

* tablename

* key name
* key column names
* key column types

* new key column values

* column names
* column types
* column values

* candidate_key_changed?
* old key column values

And have output plugin options
* include-column-types
* include-column-names
* include-primary-key

If something isn't included it's simply left out.

What still need to be determined is:
* how do we separate and escape multiple values in one CSV column
* how do we represent NULLs

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.2

From
Robert Haas
Date:
On Tue, Oct 15, 2013 at 9:17 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> It allows you to use the shared libary both as a normal extension loaded
> via shared_preload_library or adhoc and as an output plugin which seems
> like a sensible goal.
> We could have a single _PG_init_output_plugin() symbol that fills in
> such a struct which would then not conflict with using the .so
> independently. If you prefer that I'll change things around.

I think part of the problem may be that you're using the library name
to identify the output plugin.  I'm not excited about that design.
For functions, you give the function a name and that is a pointer to
where to actually find the function, which may be a 2-tuple
<library-name, function-name>, or perhaps just a 1-tuple
<builtin-function-name>, or maybe the whole text of a PL/pgsql
procedure that should be compiled.

Perhaps this ought to work similarly.  Create a function in pg_proc
which returns the structure containing the function pointers.  Then,
when that output plugin is selected, it'll automatically trigger
loading the correct shared library if that's needed; and the shared
library name may (but need not) match the output plugin name.

>> What I'd probably do is
>> emit the data in CSV format, with the first column of each line being
>> a single character indicating what sort of row this is: H means a
>> header row, defining the format of subsequent rows
>> (H,table_name,new_column1,...,new_columnj,old_key_column1,...,old_key_columnk;
>> a new header row is emitted only when the column list changes); I, U,
>> or D means an insert, update, or delete, with column 2 being the
>> transaction ID, column 3 being the table name, and the remaining
>> columns matching the last header row for emitted for that table, T
>> means meta-information about a transaction, whatever we have (e.g.
>> T,txn_id,commit_time).
>
> There's two issues I have with this:
> a) CSV seems like a bad format for this. If a transaction inserts into
> multiple tables the number of columns will constantly change. Many CSV
> parsers don't deal with that all too gracefully. E.g. you can't even
> load the data into another postgres database as an audit log.

We can pick some other separator.  I don't think ragged CSV is a big
problem; I'm actually more worried about having an easy way to handle
embedded commas and newlines and so on.  But I'd be fine with
tab-separated data or something too, if you think that's better.  What
I want is something that someone can parse with a script that can be
written in a reasonable amount of time in their favorite scripting
language.  I predict that if we provide something like this we'll
vastly expand the number of users who can make use of this new
functionality.

User: So, what's new in PostgreSQL 9.4?
Hacker: Well, now we have logical replication!
User: Why is that cool?
Hacker: Well, streaming replication is awesome for HA, but it has
significant limitations.  And trigger-based systems are very mature,
but the overhead is high and their lack of core integration makes them
hard to use.  With this technology, you can build systems that will
replicate individual tables or even parts of tables, multi-master
systems, and lots of other cool stuff.
User: Wow, that sounds great.  How do I use it?
Hacker: Well, first you write an output plugin in C using a special API.
User: Hey, do you know whether the MongoDB guys came to this conference?

Let's try that again.

User: Wow, that sounds great.  How do I use it?
Hacker: Well, currently, the output gets dumped as a series of text
files that are designed to be parsed using a scripting language.  We
have sample parsers written in Perl and Python that you can use as-is
or hack up to meet your needs.

Now, some users are still going to head for the hills.  But at least
from where I sit it sounds a hell of a lot better than the first
answer.  We're not going to solve all of the tooling problems around
this technology in one release, for sure.  But as far as 95% of our
users are concerned, a C API might as well not exist at all.  People
WILL try to machine parse the output of whatever demo plugins we
provide; so I think we should try hard to provide at least one such
plugin that is designed to make that as easy as possible.

> If we go for CSV I think we should put the entire primary key as one
> column (containing all the columns) and the entire row another.
>
> We also don't have any nice facilities for actually writing CSV - so
> we'll need to start extracting escaping code from COPY. In the end all
> that will make the output plugin very hard to use as an example because
> the code will get more complicated.
>
> b) Emitting new row descriptors everytime the schema changes will
> require keeping track of the schema. I think that won't be trivial. It
> also makes consumption of the data more complicated in comparison to
> including the description with every row.
>
> Both are even more true once we extend the format to support streaming
> of transactions while they are performed.

All fair points, but IMHO this is exactly why we need to provide a
well-written output plugin, not leave it to users to solve these
problems.

>> But regular people don't want to write C code; they just want to write
>> a crappy Perl script.  And I think we can facilitate that without too
>> much work.
>
> I think the generic output plugin should be a separate one from the
> example one (which is the one included in the patchset).

That's OK with me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.2

From
Hannu Krosing
Date:
On 10/15/2013 01:42 PM, Robert Haas wrote:
> On Mon, Oct 14, 2013 at 9:51 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>>> Well, I just think relying on specific symbol names in the .so file is
>>> kind of unfortunate.  It means that, for example, you can't have
>>> multiple output plugins provided by a single .so.  And in general I
>>> think it's something that we've tried to minimize.
>> But that's not really different when you rely on _PG_init doing it's
>> thing, right?
> Sure, that's true.  But in general I think magic symbol names aren't a
> particularly good design.
>
>>> But there's only so much information available here.  Why not just
>>> have a format that logs it all?
>> Because we do not know what "all" is? Also, how would we handle
>> replication sets and such that all of the existing replication solutions
>> have generically?
> I don't see how you can fail to know what "all" is.  
We instinctively know what "all" is - as in the famous case of buddhist
ordering a
hamburger - "Make me All wit Everything" :) - but the requirements of
different  replications systems vary wildly.

> ...
> What people
> are going to want is the operation performed (insert, update, or
> delete), all the values in the new tuple, the key values from the old
> tuple, 
For multi-master / conflict resolution you may also want all old
values to make sure that they have not changed on target.

the difference in WAL volume can be really significant, especially
in the case of DELETE, where there are no new columns.

for some forms of conflict resolution we may even want to know
the database user who initiated the operation. and possibly even
some session variables like "very_important=yes".

> ... The point, for me anyway, is that someone can write a
> crappy Perl script to apply changes from a file like this in a day.
> My contention is that there are a lot of people who will want to do
> just that, for one reason or another.  The plugin interface has
> awesome power and flexibility, and really high-performance replication
> solutions will really benefit from that.  But regular people don't
> want to write C code; they just want to write a crappy Perl script.
> And I think we can facilitate that without too much work.
just provide a to-csv or to-json plugin and the crappy perl guys
are happy.




Re: logical changeset generation v6.2

From
Hannu Krosing
Date:
On 10/15/2013 02:47 PM, Andres Freund wrote:
> On 2013-10-15 15:17:58 +0200, Andres Freund wrote:
>> If we go for CSV I think we should put the entire primary key as one
>> column (containing all the columns) and the entire row another.
just use JSON :)
>> What about columns like:
>> * action B|I|U|D|C
>>
>> * xid
>> * timestamp
>>
>> * tablename
>>
>> * key name
>> * key column names
>> * key column types
>>
>> * new key column values
>>
>> * column names
>> * column types
>> * column values
>>
>> * candidate_key_changed?
>> * old key column values
>>
>> And have output plugin options
>> * include-column-types
>> * include-column-names
>> * include-primary-key
>>
>> If something isn't included it's simply left out.
>>
>> What still need to be determined is:
>> * how do we separate and escape multiple values in one CSV column
>> * how do we represent NULLs

or borrow whatever possible from pg_dump as they have
needed to solve most of the same problems already and
consistency is good in general
>
> Greetings,
>
> Andres Freund
>




Re: logical changeset generation v6.2

From
Robert Haas
Date:
On Tue, Oct 15, 2013 at 9:47 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-10-15 15:17:58 +0200, Andres Freund wrote:
>> If we go for CSV I think we should put the entire primary key as one
>> column (containing all the columns) and the entire row another.
>
> What about columns like:
> * action B|I|U|D|C

BEGIN and COMMIT?

> * xid
> * timestamp
>
> * tablename
>
> * key name
> * key column names
> * key column types
>
> * new key column values
>
> * column names
> * column types
> * column values
>
> * candidate_key_changed?
> * old key column values

Repeating the column names for every row strikes me as a nonstarter.
If the plugin interface isn't rich enough to provide a convenient way
to avoid that, then it needs to be fixed so that it is, because it
will be a common requirement.  Sure, some people may want JSON or XML
output that reiterates the labels every time, but for a lot of people
that's going to greatly increase the size of the output and be
undesirable for that reason.

> What still need to be determined is:
> * how do we separate and escape multiple values in one CSV column
> * how do we represent NULLs

I consider the escaping a key design decision.  Ideally, it should be
something that's easy to reverse from a scripting language; ideally
also, it should be something similar to how we handle COPY.  These
goals may be in conflict; we'll have to pick something.

I'm not sure that having multiple values in one column is a good plan,
because now you need multiple levels of parsing to unpack the row.
I'd rather just have a flat column list with a key somewhere
explaining how to interpret the data.  But I'm prepared to give in on
that point so long as we can demonstrate that the format can be easily
parsed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.2

From
Robert Haas
Date:
On Tue, Oct 15, 2013 at 10:09 AM, Hannu Krosing <hannu@krosing.net> wrote:
>> I don't see how you can fail to know what "all" is.
> We instinctively know what "all" is - as in the famous case of buddhist
> ordering a
> hamburger - "Make me All wit Everything" :) - but the requirements of
> different  replications systems vary wildly.

That's true to some degree, but let's not exaggerate the degree to
which it is true.

> For multi-master / conflict resolution you may also want all old
> values to make sure that they have not changed on target.

The patch as proposed doesn't make that information available.  If you
want that to be an option, now would be the right time to argue for
it.

> for some forms of conflict resolution we may even want to know
> the database user who initiated the operation. and possibly even
> some session variables like "very_important=yes".

Well, if you have requirements like logging very_important=yes, then
you're definitely into the territory where you need your own output
plugin.  I have no problem telling people who want that sort of thing
that they've got to go write C code.  What I'm trying to do, as Larry
Wall once said, is to make simple things simple and hard things
possible.  The output plugin interface accomplishes the latter, but,
by itself, not the former.

>> And I think we can facilitate that without too much work.
> just provide a to-csv or to-json plugin and the crappy perl guys
> are happy.

Yep, that's exactly what I'm advocating for.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.2

From
Andres Freund
Date:
On 2013-10-15 10:20:55 -0400, Robert Haas wrote:
> > For multi-master / conflict resolution you may also want all old
> > values to make sure that they have not changed on target.
> 
> The patch as proposed doesn't make that information available.  If you
> want that to be an option, now would be the right time to argue for
> it.

I don't think you necessarily want it for most MM solutions, but I agree
it will be useful for some scenarios.

I think the ReorderBufferChange struct needs a better way to distinguish
between old-key and old-tuple now, but I'd rather implement the
facililty for logging the full old tuple in a separate patch. The
patchset is big enough as is, lets not tack on more features.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.2

From
Andres Freund
Date:
On 2013-10-15 10:09:05 -0400, Robert Haas wrote:
> On Tue, Oct 15, 2013 at 9:17 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > It allows you to use the shared libary both as a normal extension loaded
> > via shared_preload_library or adhoc and as an output plugin which seems
> > like a sensible goal.
> > We could have a single _PG_init_output_plugin() symbol that fills in
> > such a struct which would then not conflict with using the .so
> > independently. If you prefer that I'll change things around.
> 
> I think part of the problem may be that you're using the library name
> to identify the output plugin.  I'm not excited about that design.
> For functions, you give the function a name and that is a pointer to
> where to actually find the function, which may be a 2-tuple
> <library-name, function-name>, or perhaps just a 1-tuple
> <builtin-function-name>, or maybe the whole text of a PL/pgsql
> procedure that should be compiled.

That means you allow trivial remote code execution since you could try
to load system() or something else that's available in every shared
object. Now you can argue that that's OK since we have special checks
for replication connections, but I'd rather not go there.

> Perhaps this ought to work similarly.  Create a function in pg_proc
> which returns the structure containing the function pointers.  Then,
> when that output plugin is selected, it'll automatically trigger
> loading the correct shared library if that's needed; and the shared
> library name may (but need not) match the output plugin name.

I'd like to avoid relying on inserting stuff into pg_proc because that
makes it harder to extract WAL from a HS standby. Requiring to configure
that on the primary to extract data on the standby seems confusing to
me.

But perhaps that's the correct solution :/

> Now, some users are still going to head for the hills.  But at least
> from where I sit it sounds a hell of a lot better than the first
> answer.  We're not going to solve all of the tooling problems around
> this technology in one release, for sure.  But as far as 95% of our
> users are concerned, a C API might as well not exist at all.  People
> WILL try to machine parse the output of whatever demo plugins we
> provide; so I think we should try hard to provide at least one such
> plugin that is designed to make that as easy as possible.

Well, just providing the C API + an example in a first step didn't work
out too badly for FDWs. I am pretty sure that once released there will
soon be extensions for it on PGXN or whatever for special usecases.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.2

From
Robert Haas
Date:
On Tue, Oct 15, 2013 at 10:27 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> I think part of the problem may be that you're using the library name
>> to identify the output plugin.  I'm not excited about that design.
>> For functions, you give the function a name and that is a pointer to
>> where to actually find the function, which may be a 2-tuple
>> <library-name, function-name>, or perhaps just a 1-tuple
>> <builtin-function-name>, or maybe the whole text of a PL/pgsql
>> procedure that should be compiled.
>
> That means you allow trivial remote code execution since you could try
> to load system() or something else that's available in every shared
> object. Now you can argue that that's OK since we have special checks
> for replication connections, but I'd rather not go there.

Well, obviously you can't let somebody load any library they want.
But that's pretty much true anyway; LOAD had better be confined to
superusers unless there is something (like a pg_proc entry) that
provides "prior authorization" for that specific load.

>> Perhaps this ought to work similarly.  Create a function in pg_proc
>> which returns the structure containing the function pointers.  Then,
>> when that output plugin is selected, it'll automatically trigger
>> loading the correct shared library if that's needed; and the shared
>> library name may (but need not) match the output plugin name.
>
> I'd like to avoid relying on inserting stuff into pg_proc because that
> makes it harder to extract WAL from a HS standby. Requiring to configure
> that on the primary to extract data on the standby seems confusing to
> me.
>
> But perhaps that's the correct solution :/

That's a reasonable concern.  I don't have another idea at the moment,
unless we want to allow replication connections to issue LOAD
commands.  Then you can LOAD the library, so that the plug-in is
registered under the well-known name you expect it to have, and then
use that name to start replication.

>> Now, some users are still going to head for the hills.  But at least
>> from where I sit it sounds a hell of a lot better than the first
>> answer.  We're not going to solve all of the tooling problems around
>> this technology in one release, for sure.  But as far as 95% of our
>> users are concerned, a C API might as well not exist at all.  People
>> WILL try to machine parse the output of whatever demo plugins we
>> provide; so I think we should try hard to provide at least one such
>> plugin that is designed to make that as easy as possible.
>
> Well, just providing the C API + an example in a first step didn't work
> out too badly for FDWs. I am pretty sure that once released there will
> soon be extensions for it on PGXN or whatever for special usecases.

I suspect so, too.  But I also think that if that's the only thing
available in the first release, a lot of users will get a poor initial
impression.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.2

From
Andres Freund
Date:
On 2013-10-15 10:15:14 -0400, Robert Haas wrote:
> On Tue, Oct 15, 2013 at 9:47 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2013-10-15 15:17:58 +0200, Andres Freund wrote:
> >> If we go for CSV I think we should put the entire primary key as one
> >> column (containing all the columns) and the entire row another.
> >
> > What about columns like:
> > * action B|I|U|D|C
> 
> BEGIN and COMMIT?

That's B and C, yes. You'd rather not have them? When would you replay
the commit without an explicit message telling you to?

> Repeating the column names for every row strikes me as a nonstarter.
> [...]
> Sure, some people may want JSON or XML
> output that reiterates the labels every time, but for a lot of people
> that's going to greatly increase the size of the output and be
> undesirable for that reason.

But I argue that most simpler users - which are exactly the ones a
generic output plugin is aimed at - will want all column names since it
makes replay far easier.

> If the plugin interface isn't rich enough to provide a convenient way
> to avoid that, then it needs to be fixed so that it is, because it
> will be a common requirement.

Oh, it surely is possibly to avoid repeating it. The output plugin
interface simply gives you a relcache entry, that contains everything
necessary.
The output plugin would need to keep track of whether it has output data
for a specific relation and it would need to check whether the table
definition has changed, but I don't see how we could avoid that?

> > What still need to be determined is:
> > * how do we separate and escape multiple values in one CSV column
> > * how do we represent NULLs
> 
> I consider the escaping a key design decision.  Ideally, it should be
> something that's easy to reverse from a scripting language; ideally
> also, it should be something similar to how we handle COPY.  These
> goals may be in conflict; we'll have to pick something.

Note that parsing COPYs is a major PITA from most languages...

Perhaps we should make the default output json instead? With every
action terminated by a nullbyte?
That's probably easier to parse from various scripting languages than
anything else.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.2

From
Andres Freund
Date:
On 2013-10-15 10:34:53 -0400, Robert Haas wrote:
> On Tue, Oct 15, 2013 at 10:27 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> I think part of the problem may be that you're using the library name
> >> to identify the output plugin.  I'm not excited about that design.
> >> For functions, you give the function a name and that is a pointer to
> >> where to actually find the function, which may be a 2-tuple
> >> <library-name, function-name>, or perhaps just a 1-tuple
> >> <builtin-function-name>, or maybe the whole text of a PL/pgsql
> >> procedure that should be compiled.
> >
> > That means you allow trivial remote code execution since you could try
> > to load system() or something else that's available in every shared
> > object. Now you can argue that that's OK since we have special checks
> > for replication connections, but I'd rather not go there.
> 
> Well, obviously you can't let somebody load any library they want.
> But that's pretty much true anyway; LOAD had better be confined to
> superusers unless there is something (like a pg_proc entry) that
> provides "prior authorization" for that specific load.

Currently you can create users that have permissions for replication but
which are not superusers. I think we should strive to providing that
capability for changeset extraction as well.

> >> Perhaps this ought to work similarly.  Create a function in pg_proc
> >> which returns the structure containing the function pointers.  Then,
> >> when that output plugin is selected, it'll automatically trigger
> >> loading the correct shared library if that's needed; and the shared
> >> library name may (but need not) match the output plugin name.
> >
> > I'd like to avoid relying on inserting stuff into pg_proc because that
> > makes it harder to extract WAL from a HS standby. Requiring to configure
> > that on the primary to extract data on the standby seems confusing to
> > me.
> >
> > But perhaps that's the correct solution :/
> 
> That's a reasonable concern.  I don't have another idea at the moment,
> unless we want to allow replication connections to issue LOAD
> commands.  Then you can LOAD the library, so that the plug-in is
> registered under the well-known name you expect it to have, and then
> use that name to start replication.

But what's the advantage of that over the current situation or one where
PG_load_output_plugin() is called? The current and related
implementations allow you to only load libraries in some designated
postgres directories and it doesn't allow you to call any arbitrary
functions in there.

Would you be content with a symbol "PG_load_output_plugin" being called
that fills out the actual callbacks?

> >> Now, some users are still going to head for the hills.  But at least
> >> from where I sit it sounds a hell of a lot better than the first
> >> answer.  We're not going to solve all of the tooling problems around
> >> this technology in one release, for sure.  But as far as 95% of our
> >> users are concerned, a C API might as well not exist at all.  People
> >> WILL try to machine parse the output of whatever demo plugins we
> >> provide; so I think we should try hard to provide at least one such
> >> plugin that is designed to make that as easy as possible.
> >
> > Well, just providing the C API + an example in a first step didn't work
> > out too badly for FDWs. I am pretty sure that once released there will
> > soon be extensions for it on PGXN or whatever for special usecases.
> 
> I suspect so, too.  But I also think that if that's the only thing
> available in the first release, a lot of users will get a poor initial
> impression.

I think lots of people will expect a builtin logical replication
solution :/. Which seems a tad unlikely to arrive in 9.4.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.2

From
Robert Haas
Date:
On Tue, Oct 15, 2013 at 10:48 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > What about columns like:
>> > * action B|I|U|D|C
>>
>> BEGIN and COMMIT?
>
> That's B and C, yes. You'd rather not have them? When would you replay
> the commit without an explicit message telling you to?

No, BEGIN and COMMIT sounds good, actually.  Just wanted to make sure
I understood.

>> Repeating the column names for every row strikes me as a nonstarter.
>> [...]
>> Sure, some people may want JSON or XML
>> output that reiterates the labels every time, but for a lot of people
>> that's going to greatly increase the size of the output and be
>> undesirable for that reason.
>
> But I argue that most simpler users - which are exactly the ones a
> generic output plugin is aimed at - will want all column names since it
> makes replay far easier.

Meh, maybe.

>> If the plugin interface isn't rich enough to provide a convenient way
>> to avoid that, then it needs to be fixed so that it is, because it
>> will be a common requirement.
>
> Oh, it surely is possibly to avoid repeating it. The output plugin
> interface simply gives you a relcache entry, that contains everything
> necessary.
> The output plugin would need to keep track of whether it has output data
> for a specific relation and it would need to check whether the table
> definition has changed, but I don't see how we could avoid that?

Well, it might be nice if there were a callback for, hey, schema has
changed!  Seems like a lot of plugins will want to know that for one
reason or another, and rechecking for every tuple sounds expensive.

>> > What still need to be determined is:
>> > * how do we separate and escape multiple values in one CSV column
>> > * how do we represent NULLs
>>
>> I consider the escaping a key design decision.  Ideally, it should be
>> something that's easy to reverse from a scripting language; ideally
>> also, it should be something similar to how we handle COPY.  These
>> goals may be in conflict; we'll have to pick something.
>
> Note that parsing COPYs is a major PITA from most languages...
>
> Perhaps we should make the default output json instead? With every
> action terminated by a nullbyte?
> That's probably easier to parse from various scripting languages than
> anything else.

I could go for that.  It's not quite as compact as I might hope, but
JSON does seem to make people awfully happy.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.2

From
Robert Haas
Date:
On Tue, Oct 15, 2013 at 10:56 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > That means you allow trivial remote code execution since you could try
>> > to load system() or something else that's available in every shared
>> > object. Now you can argue that that's OK since we have special checks
>> > for replication connections, but I'd rather not go there.
>>
>> Well, obviously you can't let somebody load any library they want.
>> But that's pretty much true anyway; LOAD had better be confined to
>> superusers unless there is something (like a pg_proc entry) that
>> provides "prior authorization" for that specific load.
>
> Currently you can create users that have permissions for replication but
> which are not superusers. I think we should strive to providing that
> capability for changeset extraction as well.

I agree.

>> >> Perhaps this ought to work similarly.  Create a function in pg_proc
>> >> which returns the structure containing the function pointers.  Then,
>> >> when that output plugin is selected, it'll automatically trigger
>> >> loading the correct shared library if that's needed; and the shared
>> >> library name may (but need not) match the output plugin name.
>> >
>> > I'd like to avoid relying on inserting stuff into pg_proc because that
>> > makes it harder to extract WAL from a HS standby. Requiring to configure
>> > that on the primary to extract data on the standby seems confusing to
>> > me.
>> >
>> > But perhaps that's the correct solution :/
>>
>> That's a reasonable concern.  I don't have another idea at the moment,
>> unless we want to allow replication connections to issue LOAD
>> commands.  Then you can LOAD the library, so that the plug-in is
>> registered under the well-known name you expect it to have, and then
>> use that name to start replication.
>
> But what's the advantage of that over the current situation or one where
> PG_load_output_plugin() is called? The current and related
> implementations allow you to only load libraries in some designated
> postgres directories and it doesn't allow you to call any arbitrary
> functions in there.

Well, I've already said why I don't like conflating the library name
and the plugin name.  It rules out core plugins and libraries that
provide multiple plugins.  I don't have anything to add to that.

> Would you be content with a symbol "PG_load_output_plugin" being called
> that fills out the actual callbacks?

Well, it doesn't fix the muddling of library names with output plugin
names, but I suppose I'd find it a modest improvement.

>> > Well, just providing the C API + an example in a first step didn't work
>> > out too badly for FDWs. I am pretty sure that once released there will
>> > soon be extensions for it on PGXN or whatever for special usecases.
>>
>> I suspect so, too.  But I also think that if that's the only thing
>> available in the first release, a lot of users will get a poor initial
>> impression.
>
> I think lots of people will expect a builtin logical replication
> solution :/. Which seems a tad unlikely to arrive in 9.4.

Yep.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.2

From
"ktm@rice.edu"
Date:
On Tue, Oct 15, 2013 at 11:02:39AM -0400, Robert Haas wrote:
> >> goals may be in conflict; we'll have to pick something.
> >
> > Note that parsing COPYs is a major PITA from most languages...
> >
> > Perhaps we should make the default output json instead? With every
> > action terminated by a nullbyte?
> > That's probably easier to parse from various scripting languages than
> > anything else.
> 
> I could go for that.  It's not quite as compact as I might hope, but
> JSON does seem to make people awfully happy.
> 
> -- 
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
> 

Feeding such a JSON stream into a compression algorithm like lz4 or
snappy should result in a pretty compact stream. The latest lz4 updates
also have ability to use a pre-existing dictionary which would really
help remove the redundant pieces.

Regards,
Ken



Re: logical changeset generation v6.2

From
Andres Freund
Date:
On 2013-10-15 11:02:39 -0400, Robert Haas wrote:
> >> If the plugin interface isn't rich enough to provide a convenient way
> >> to avoid that, then it needs to be fixed so that it is, because it
> >> will be a common requirement.
> >
> > Oh, it surely is possibly to avoid repeating it. The output plugin
> > interface simply gives you a relcache entry, that contains everything
> > necessary.
> > The output plugin would need to keep track of whether it has output data
> > for a specific relation and it would need to check whether the table
> > definition has changed, but I don't see how we could avoid that?
> 
> Well, it might be nice if there were a callback for, hey, schema has
> changed!  Seems like a lot of plugins will want to know that for one
> reason or another, and rechecking for every tuple sounds expensive.

I don't really see how we could provide that in any useful manner. We
could provide a callback that is called whenever another transaction has
changed the schema, but there's nothing easily to be done about schema
changes by the replayed transaction itself. And those are the only ones
where meaningful schema changes can happen since the locks the source
transaction has held will prevent most other schema changes.

As much as I hate such code, I guess checking (and possibly storing) the
ctid||xmin of the pg_class row is the easiest thing we could do :(.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.2

From
Josh Berkus
Date:
On 10/15/2013 07:56 AM, Andres Freund wrote:
>>> Well, just providing the C API + an example in a first step didn't work
>>> > > out too badly for FDWs. I am pretty sure that once released there will
>>> > > soon be extensions for it on PGXN or whatever for special usecases.
>> > 
>> > I suspect so, too.  But I also think that if that's the only thing
>> > available in the first release, a lot of users will get a poor initial
>> > impression.
> I think lots of people will expect a builtin logical replication
> solution :/. Which seems a tad unlikely to arrive in 9.4.

Well, last I checked the Slony team is hard at work on building
something which will be based on logical changesets.  So there will
likely be at least one tool available shortly after 9.4 is released.

A good and flexible API is, IMHO, more important than having any
finished solution.  The whole reason why logical replication was outside
core PG for so long is that replication systems have differing and
mutually incompatible goals.  A good API can support all of those goals;
a user-level tool, no matter how good, can't.

And, frankly, once the API is built, how hard will it be to write a
script which does the simplest replication approach (replay all
statements on slave)?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: logical changeset generation v6.2

From
David Fetter
Date:
On Tue, Oct 15, 2013 at 10:09:05AM -0400, Robert Haas wrote:
> On Tue, Oct 15, 2013 at 9:17 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> User: So, what's new in PostgreSQL 9.4?
> Hacker: Well, now we have logical replication!
> User: Why is that cool?
> Hacker: Well, streaming replication is awesome for HA, but it has
> significant limitations.  And trigger-based systems are very mature,
> but the overhead is high and their lack of core integration makes them
> hard to use.  With this technology, you can build systems that will
> replicate individual tables or even parts of tables, multi-master
> systems, and lots of other cool stuff.
> User: Wow, that sounds great.  How do I use it?
> Hacker: Well, first you write an output plugin in C using a special API.
> User: Hey, do you know whether the MongoDB guys came to this conference?
> 
> Let's try that again.
> 
> User: Wow, that sounds great.  How do I use it?
> Hacker: Well, currently, the output gets dumped as a series of text
> files that are designed to be parsed using a scripting language.  We
> have sample parsers written in Perl and Python that you can use as-is
> or hack up to meet your needs.

My version:

Hacker: the output gets dumped as a series of JSON files.  We have
docs for this rev of the format and examples of consumers in Perl and
Python you can use as-is or hack up to meet your needs.

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: logical changeset generation v6.2

From
Peter Geoghegan
Date:
On Tue, Oct 15, 2013 at 7:09 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Let's try that again.
>
> User: Wow, that sounds great.  How do I use it?
> Hacker: Well, currently, the output gets dumped as a series of text
> files that are designed to be parsed using a scripting language.  We
> have sample parsers written in Perl and Python that you can use as-is
> or hack up to meet your needs.

Have you heard of multicorn? Plugin authors can write a wrapper that
spits out JSON or whatever other thing they like, which can be
consumed by non C-hackers.

> Now, some users are still going to head for the hills.  But at least
> from where I sit it sounds a hell of a lot better than the first
> answer.  We're not going to solve all of the tooling problems around
> this technology in one release, for sure.  But as far as 95% of our
> users are concerned, a C API might as well not exist at all.  People
> WILL try to machine parse the output of whatever demo plugins we
> provide; so I think we should try hard to provide at least one such
> plugin that is designed to make that as easy as possible.

I agree that this is important, but I wouldn't like to weigh it too heavily.

-- 
Peter Geoghegan



Re: logical changeset generation v6.4

From
Robert Haas
Date:
On Mon, Oct 14, 2013 at 9:12 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> Attached you can find version 6.4 of the patchset:

So I'm still unhappy with the arbitrary logic in what's now patch 1
for choosing the candidate key.  On another thread, someone mentioned
that they might want the entire old tuple, and that got me thinking:
there's no particular reason why the user has to want exactly the
columns that exist in some unique, immediate, non-partial index (what
a name).  So I have two proposals:

1. Instead of allowing the user to choose the index to be used, or
picking it for them, how about if we let them choose the old-tuple
columns they want logged?  This could be a per-column option.  If the
primary key can be assumed known and unchanging, then the answer might
be that the user wants *no* old-tuple columns logged.  Contrariwise
someone might want everything logged, or anything in the middle.

2. If that seems too complicated, how about just logging the whole old
tuple for version 1?

I'm basically fine with the rest of what's in the first two patches,
but we need to sort out some kind of consensus on this issue.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.4

From
Merlin Moncure
Date:
On Fri, Oct 18, 2013 at 7:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Oct 14, 2013 at 9:12 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> Attached you can find version 6.4 of the patchset:
>
> So I'm still unhappy with the arbitrary logic in what's now patch 1
> for choosing the candidate key.  On another thread, someone mentioned
> that they might want the entire old tuple, and that got me thinking:
> there's no particular reason why the user has to want exactly the
> columns that exist in some unique, immediate, non-partial index (what
> a name).  So I have two proposals:

Aside: what's an immediate index?  Is this speaking to the constraint?
(immediate vs deferred?)

merlin



Re: logical changeset generation v6.2

From
Andres Freund
Date:
On 2013-10-14 09:36:03 -0400, Robert Haas wrote:
> > I thought and implemented that in the beginning. Unfortunately it's not
> > enough :(. That's probably the issue that took me longest to understand
> > in this patchseries...
> >
> > Combocids can only fix the case where a transaction actually has create
> > a combocid:
> >
> > 1) TX1: INSERT id = 1 at 0/1: (xmin = 1, xmax=Invalid, cmin = 55, cmax = Invalid)
> > 2) TX2: DELETE id = 1 at 0/1: (xmin = 1, xmax=2, cmin = Invalid, cmax = 1)
> >
> > So, if we're decoding data that needs to lookup those rows in TX1 or TX2
> > we both times need access to cmin and cmax, but neither transaction will
> > have created a multixact. That can only be an issue in transaction with
> > catalog modifications.
> 
> Oh, yuck.  So that means you have to write an extra WAL record for
> EVERY heap insert, update, or delete to a catalog table?  OUCH.

So. As it turns out that solution isn't sufficient in the face of VACUUM
FULL and mixed DML/DDL transaction that have not yet been decoded.

To reiterate, as published it works like:
For every modification of catalog tuple (insert, multi_insert, update,
delete) that has influence over visibility issue a record that contains:
* filenode
* ctid
* (cmin, cmax)

When doing a visibility check on a catalog row during decoding of mixed
DML/DDL transaction lookup (cmin, cmax) for that row since we don't
store both for the tuple.

That mostly works great.

The problematic scenario is decoding a transaction that has done mixed
DML/DDL *after* a VACUUM FULL/CLUSTER has been performed. The VACUUM
FULL obviously changes the filenode and the ctid of a tuple, so we
cannot successfully do a lookup based on what we logged before.

I know of the following solutions:
1) Don't allow VACUUM FULL on catalog tables if wal_level = logical.
2) Make VACUUM FULL prevent DDL and then wait till all changestreams  have decoded up to the current point.
3) don't delete the old relfilenode for VACUUM/CLUSTERs of system tables  if there are life decoding slots around,
insteaddelegate that  responsibility to the slot management.
 
4) Store both (cmin, cmax) for catalog tuples.

I bascially think only 1) and 4) are realistic. And 1) sucks.

I've developed a prototype for 4) and except currently being incredibly
ugly, it seems to be the most promising approach by far. My trick to
store both cmin and cmax is to store cmax in t_hoff managed space when
wal_level = logical.
That even works when changing wal_level from < logical to logical
because only ever need to store both cmin and cmax for transactions that
have decodeable content - which they cannot yet have before wal_level =
logical.

This requires some not so nice things:
* A way to declare we're storing both. I've currently chosen HEAP_MOVED_OFF | HEAP_MOVED_IN. That sucks.
* A way for heap_form_tuple to know it should add the necessary space to t_hoff. I've added TupleDesc->tdhaswidecid for
it.
* Fiddling with existing checks for HEAP_MOVED{,OFF,IN} to check for both set at the same time.
* Changing the WAL logging to (optionally?) transport the current CommandId instead of always resetting it
InvalidCommandId.

The benefits are:
* Working VACUUM FULL
* Much simpler tqual.c logic, everything is stored in the row itself. No hash or something like that built.
* No more need to log (relfilenode, cmin, cmax) separately from heap changes itself anymore.

In the end, the costs are that individual catalog rows are 4 bytes
bigger iff wal_level = logical. That seems acceptable.

Some questions remain:
* Better idea for a flag than HEAP_MOVED_OFF | HEAP_MOVED_IN
* Should we just unconditionally log the current CommandId or make it conditional. We have plenty of flag space to
signalwhether it's present, but it's just 4 bytes.
 

Comments?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.4

From
Andres Freund
Date:
On 2013-10-18 08:11:29 -0400, Robert Haas wrote:
> On Mon, Oct 14, 2013 at 9:12 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > Attached you can find version 6.4 of the patchset:
> 
> So I'm still unhappy with the arbitrary logic in what's now patch 1
> for choosing the candidate key.  On another thread, someone mentioned
> that they might want the entire old tuple, and that got me thinking:
> there's no particular reason why the user has to want exactly the
> columns that exist in some unique, immediate, non-partial index (what
> a name).  So I have two proposals:

> 1. Instead of allowing the user to choose the index to be used, or
> picking it for them, how about if we let them choose the old-tuple
> columns they want logged?  This could be a per-column option.  If the
> primary key can be assumed known and unchanging, then the answer might
> be that the user wants *no* old-tuple columns logged.  Contrariwise
> someone might want everything logged, or anything in the middle.

I definitely can see the usecase for logging anything or nothing,
arbitrary column select seems to be too complicated for now.

> 2. If that seems too complicated, how about just logging the whole old
> tuple for version 1?

I think that'd make the patch much less useful because it bloats WAL
unnecessarily for the primary user (replication) of it. I'd rather go
for primary keys only if that proves to be the contentious point.

How about modifying the selection to go from:
* all rows if ALTER TABLE ... REPLICA IDENTITY NOTHING|FULL;
* index chosen by ALTER TABLE ... REPLICA IDENTITY USING indexname
* [later, maybe] ALTER TABLE ... REPLICA IDENTITY (cola, colb)
* primary key
* candidate key with the smallest oid

Including the candidate key will help people using changeset extration
for auditing that do not have primary key. That really isn't an
infrequent usecase.

I've chosen REPLICA IDENTITY; NOTHIN; FULL; because those are all
existing keywords, and afaics shouldn't generate any conflicts. On a
green field we probably name them differently, but ...

Comments?

Greetings,

Andres Freund

PS: candidate key implies a key which is: immediate (aka not deferred),
unique, non-partial and only contains NOT NULL columns.

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.2

From
Robert Haas
Date:
On Fri, Oct 18, 2013 at 2:26 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> I know of the following solutions:
> 1) Don't allow VACUUM FULL on catalog tables if wal_level = logical.
> 2) Make VACUUM FULL prevent DDL and then wait till all changestreams
>    have decoded up to the current point.
> 3) don't delete the old relfilenode for VACUUM/CLUSTERs of system tables
>    if there are life decoding slots around, instead delegate that
>    responsibility to the slot management.
> 4) Store both (cmin, cmax) for catalog tuples.
>
> I bascially think only 1) and 4) are realistic. And 1) sucks.
>
> I've developed a prototype for 4) and except currently being incredibly
> ugly, it seems to be the most promising approach by far. My trick to
> store both cmin and cmax is to store cmax in t_hoff managed space when
> wal_level = logical.

In my opinion, (4) is too ugly to consider.  I think that if we start
playing games like this, we're opening up the doors to lots of subtle
bugs and future architectural pain that will be with us for many, many
years to come.  I believe we will bitterly regret any foray into this
area.

It has long seemed to me to be a shame that we don't have some system
for allowing old relfilenodes to stick around until they are no longer
in use.  If we had that, we might be able to allow utilities like
CLUSTER or VACUUM FULL to permit concurrent read access to the table.
I realize that what people really want is to let those things run
while allowing concurrent *write* access to the table, but a bird in
the hand is worth two in the bush.  What we're really talking about
here is applying MVCC to filesystem actions: instead of removing the
old relfilenode(s) immediately, we do it when they're no longer
referenced by anyone, just as we don't remove old tuples immediately,
but rather when they are no longer referenced by anyone.  The details
are tricky, though: we can allow write access to the *new* heap just
as soon as the rewrite is finished, but anyone who is still looking at
the *old* heap can't ever upgrade their AccessShareLock to anything
higher, or hilarity will ensue.  Also, if they lock some *other*
relation and AcceptInvalidationMessages(), their relcache entry for
the rewritten relation will get rebuilt, and that's bound to work out
poorly.  The net-net here is that I think (3) is an attractive
solution, but I don't know that we can make it work in a reasonable
amount of time.

I don't think I understand exactly what you have in mind for (2); can
you elaborate?  I have always thought that having a
WaitForDecodingToCatchUp() primitive was a good way of handling
changes that were otherwise too difficult to track our way through.  I
am not sure you're doing that at all right now, which in some sense I
guess is fine, but I haven't really understood your aversion to this
solution.  There are some locking issues to be worked out here, but
the problems don't seem altogether intractable.

(1) is basically deciding not to fix the problem.  I don't think
that's acceptable.

I don't have another idea right at the moment.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.4

From
Robert Haas
Date:
On Fri, Oct 18, 2013 at 2:50 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> How about modifying the selection to go from:
> * all rows if ALTER TABLE ... REPLICA IDENTITY NOTHING|FULL;
> * index chosen by ALTER TABLE ... REPLICA IDENTITY USING indexname
> * [later, maybe] ALTER TABLE ... REPLICA IDENTITY (cola, colb)
> * primary key
> * candidate key with the smallest oid
>
> Including the candidate key will help people using changeset extration
> for auditing that do not have primary key. That really isn't an
> infrequent usecase.
>
> I've chosen REPLICA IDENTITY; NOTHIN; FULL; because those are all
> existing keywords, and afaics shouldn't generate any conflicts. On a
> green field we probably name them differently, but ...

I'm really pretty much dead set against the "candidate key with the
smallest OID" proposal.  I think that's just plain old bad idea.  It's
just magical behavior which will result in users being surprised and
unhappy.  I don't think there's really a problem with saying, hey, if
you configure changeset extraction and you don't configure a replica
identity, then you don't get any columns from the old tuple.  If you
don't like that, change the configuration.  It's always nice to spare
users unnecessary configuration, of course, but trying to make things
simpler than they really are tends to hurt more than it helps.

On the naming, I find REPLICA IDENTITY to be pretty good.  We've
already places where we're using the REPLICA keyword to indicate
places where we've got core support intended to dovetail with external
replication solutions, and this seems to fit that paradigm nicely.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.4

From
Andres Freund
Date:
On 2013-10-21 09:40:13 -0400, Robert Haas wrote:
> On Fri, Oct 18, 2013 at 2:50 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > How about modifying the selection to go from:
> > * all rows if ALTER TABLE ... REPLICA IDENTITY NOTHING|FULL;
> > * index chosen by ALTER TABLE ... REPLICA IDENTITY USING indexname
> > * [later, maybe] ALTER TABLE ... REPLICA IDENTITY (cola, colb)
> > * primary key
> > * candidate key with the smallest oid
> >
> > Including the candidate key will help people using changeset extration
> > for auditing that do not have primary key. That really isn't an
> > infrequent usecase.
> >
> > I've chosen REPLICA IDENTITY; NOTHIN; FULL; because those are all
> > existing keywords, and afaics shouldn't generate any conflicts. On a
> > green field we probably name them differently, but ...
> 
> I'm really pretty much dead set against the "candidate key with the
> smallest OID" proposal.  I think that's just plain old bad idea.  It's
> just magical behavior which will result in users being surprised and
> unhappy.  I don't think there's really a problem with saying, hey, if
> you configure changeset extraction and you don't configure a replica
> identity, then you don't get any columns from the old tuple.

I have a hard time to understand why you dislike it so much. Think of a
big schema where you want to add auditing via changeset
extraction. Because of problems with reindexing primary key you've just
used candidate keys so far. Why should you go through each of a couple
of hundred tables and explictly choose an index when you just want an
identifier of changed rows?
By nature of it being a candidate key it is *guranteed* to uniquely
identify a row? And you can make the output plugin give you the used
columns/the indexname without a problem.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.2

From
Andres Freund
Date:
On 2013-10-21 09:32:12 -0400, Robert Haas wrote:
> On Fri, Oct 18, 2013 at 2:26 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > I know of the following solutions:
> > 1) Don't allow VACUUM FULL on catalog tables if wal_level = logical.
> > 2) Make VACUUM FULL prevent DDL and then wait till all changestreams
> >    have decoded up to the current point.
> > 3) don't delete the old relfilenode for VACUUM/CLUSTERs of system tables
> >    if there are life decoding slots around, instead delegate that
> >    responsibility to the slot management.
> > 4) Store both (cmin, cmax) for catalog tuples.
> >
> > I bascially think only 1) and 4) are realistic. And 1) sucks.
> >
> > I've developed a prototype for 4) and except currently being incredibly
> > ugly, it seems to be the most promising approach by far. My trick to
> > store both cmin and cmax is to store cmax in t_hoff managed space when
> > wal_level = logical.
> 
> In my opinion, (4) is too ugly to consider.  I think that if we start
> playing games like this, we're opening up the doors to lots of subtle
> bugs and future architectural pain that will be with us for many, many
> years to come.  I believe we will bitterly regret any foray into this
> area.

Hm. After looking at the required code - which you obviously cannot have
yet - it's not actually too bad. Will post a patch implementing it later.

I don't really buy the architectural argument since originally cmin/cmax
*were* both stored. It's not something we're just inventing now. We just
optimized that away but now have discovered that's not always a good
idea and thus don't always use the optimization.

The actual decoding code shrinks by about 200 lines using this logic
which is a hint that it's not a bad idea.

> It has long seemed to me to be a shame that we don't have some system
> for allowing old relfilenodes to stick around until they are no longer
> in use.  If we had that, we might be able to allow utilities like
> CLUSTER or VACUUM FULL to permit concurrent read access to the table.
> I realize that what people really want is to let those things run
> while allowing concurrent *write* access to the table, but a bird in
> the hand is worth two in the bush.  What we're really talking about
> here is applying MVCC to filesystem actions: instead of removing the
> old relfilenode(s) immediately, we do it when they're no longer
> referenced by anyone, just as we don't remove old tuples immediately,
> but rather when they are no longer referenced by anyone.  The details
> are tricky, though: we can allow write access to the *new* heap just
> as soon as the rewrite is finished, but anyone who is still looking at
> the *old* heap can't ever upgrade their AccessShareLock to anything
> higher, or hilarity will ensue.  Also, if they lock some *other*
> relation and AcceptInvalidationMessages(), their relcache entry for
> the rewritten relation will get rebuilt, and that's bound to work out
> poorly.  The net-net here is that I think (3) is an attractive
> solution, but I don't know that we can make it work in a reasonable
> amount of time.

I've looked at it before, and I honestly don't have a real clue how to
do it robustly.

> I don't think I understand exactly what you have in mind for (2); can
> you elaborate?  I have always thought that having a
> WaitForDecodingToCatchUp() primitive was a good way of handling
> changes that were otherwise too difficult to track our way through.  I
> am not sure you're doing that at all right now, which in some sense I
> guess is fine, but I haven't really understood your aversion to this
> solution.  There are some locking issues to be worked out here, but
> the problems don't seem altogether intractable.

So, what we need to do for rewriting catalog tables would be:
1) lock table against writes
2) wait for all in-progress xacts to finish, they could have modified  the table in question (we don't keep locks on
systemtables)
 
3) acquire xlog insert pointer
4) wait for all logical decoding actions to read past that pointer
5) upgrade the lock to an access exclusive one
6) perform vacuum full as usual

The lock upgrade hazards in here are the reason I am adverse to the
solution. And I don't see how we can avoid them, since in order for
decoding to catchup it has to be able to read from the
catalog... Otherwise it's easy enough to implement.

> (1) is basically deciding not to fix the problem.  I don't think
> that's acceptable.

I'd like to argue against this, but unfortunately I agree.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.4

From
Hannu Krosing
Date:
On 10/18/2013 08:50 PM, Andres Freund wrote:
> On 2013-10-18 08:11:29 -0400, Robert Haas wrote:
...
>> 2. If that seems too complicated, how about just logging the whole old
>> tuple for version 1?
> I think that'd make the patch much less useful because it bloats WAL
> unnecessarily for the primary user (replication) of it. I'd rather go
> for primary keys only if that proves to be the contentious point.
>
> How about modifying the selection to go from:
> * all rows if ALTER TABLE ... REPLICA IDENTITY NOTHING|FULL;
> * index chosen by ALTER TABLE ... REPLICA IDENTITY USING indexname
> * [later, maybe] ALTER TABLE ... REPLICA IDENTITY (cola, colb)
> * primary key
> * candidate key with the smallest oid
>
> Including the candidate key will help people using changeset extration
> for auditing that do not have primary key. That really isn't an
> infrequent usecase.
As I understand it for a table with *no* unique index,
the "candidate key" is the full tuple, so if we get an UPDATE for
it then this should be replicated as
"UPDATE first row matching (NOT DISTINCT FROM) all columns"
which on replay side will be equivalent to
CREATE CURSOR ...; FETCH 1 ...; UPDATE ... WHERE CURRENT...'

I know that this will slow down replication, as you can not use direct
index updates internally - at least not easily - but need to let postgreSQL
actually plan this, but such single row update is no faster on origin
either.

Of course when it is a full-table update on a table with no
indexes, then doing the same one tuple at a time is really slow.



-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ




Re: logical changeset generation v6.4

From
Andres Freund
Date:
On 2013-10-21 16:40:43 +0200, Hannu Krosing wrote:
> On 10/18/2013 08:50 PM, Andres Freund wrote:
> > On 2013-10-18 08:11:29 -0400, Robert Haas wrote:
> ...
> >> 2. If that seems too complicated, how about just logging the whole old
> >> tuple for version 1?
> > I think that'd make the patch much less useful because it bloats WAL
> > unnecessarily for the primary user (replication) of it. I'd rather go
> > for primary keys only if that proves to be the contentious point.
> >
> > How about modifying the selection to go from:
> > * all rows if ALTER TABLE ... REPLICA IDENTITY NOTHING|FULL;
> > * index chosen by ALTER TABLE ... REPLICA IDENTITY USING indexname
> > * [later, maybe] ALTER TABLE ... REPLICA IDENTITY (cola, colb)
> > * primary key
> > * candidate key with the smallest oid
> >
> > Including the candidate key will help people using changeset extration
> > for auditing that do not have primary key. That really isn't an
> > infrequent usecase.

> As I understand it for a table with *no* unique index,
> the "candidate key" is the full tuple, so if we get an UPDATE for
> it then this should be replicated as
> "UPDATE first row matching (NOT DISTINCT FROM) all columns"
> which on replay side will be equivalent to
> CREATE CURSOR ...; FETCH 1 ...; UPDATE ... WHERE CURRENT...'

No, it's not a candidate key since it's not uniquely identifying a
row. You can play tricks as you describe, but that still doesn't make
the whole row a candidate key.

But anyway,  I suggest allowing for logging all columns above...

> I know that this will slow down replication, as you can not use direct
> index updates internally - at least not easily - but need to let postgreSQL
> actually plan this, but such single row update is no faster on origin
> either.

That's not actually true. Consider somebody doing something like:
UPDATE big_table_without_indexes SET column = ...;
On the source side that's essentialy O(n). If you replicate on a
row-by-row basis it will be O(n^2) on the replay side.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.4

From
Robert Haas
Date:
On Mon, Oct 21, 2013 at 9:51 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> I have a hard time to understand why you dislike it so much. Think of a
> big schema where you want to add auditing via changeset
> extraction. Because of problems with reindexing primary key you've just
> used candidate keys so far. Why should you go through each of a couple
> of hundred tables and explictly choose an index when you just want an
> identifier of changed rows?
> By nature of it being a candidate key it is *guranteed* to uniquely
> identify a row? And you can make the output plugin give you the used
> columns/the indexname without a problem.

Sure, well, if a particular user wants to choose candidate keys
essentially at random from among the unique indexes present, there's
nothing to prevent them from writing a script to do that.  But
assuming that one unique index is just as good as another is just
wrong.  If you pick a "candidate key" that doesn't actually represent
the users' notion of row identity, then your audit log will be
thoroughly useless, even if it does uniquely identify the rows
involved.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.4

From
Andres Freund
Date:
On 2013-10-21 11:14:37 -0400, Robert Haas wrote:
> On Mon, Oct 21, 2013 at 9:51 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > I have a hard time to understand why you dislike it so much. Think of a
> > big schema where you want to add auditing via changeset
> > extraction. Because of problems with reindexing primary key you've just
> > used candidate keys so far. Why should you go through each of a couple
> > of hundred tables and explictly choose an index when you just want an
> > identifier of changed rows?
> > By nature of it being a candidate key it is *guranteed* to uniquely
> > identify a row? And you can make the output plugin give you the used
> > columns/the indexname without a problem.
> 
> Sure, well, if a particular user wants to choose candidate keys
> essentially at random from among the unique indexes present, there's
> nothing to prevent them from writing a script to do that.  But
> assuming that one unique index is just as good as another is just
> wrong.  If you pick a "candidate key" that doesn't actually represent
> the users' notion of row identity, then your audit log will be
> thoroughly useless, even if it does uniquely identify the rows
> involved.

Why? If the columns are specified in the log, by definition the values
will be sufficient to identify a row. Even if a "nicer" key might exist.

Since I seemingly can't convince you, I'll modify things that way for
now as it can easily be changed later, but I still don't see the
problem.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.4

From
Hannu Krosing
Date:
On 10/21/2013 05:06 PM, Andres Freund wrote:
> On 2013-10-21 16:40:43 +0200, Hannu Krosing wrote:
>> On 10/18/2013 08:50 PM, Andres Freund wrote:
>>> On 2013-10-18 08:11:29 -0400, Robert Haas wrote:
>> ...
>>>> 2. If that seems too complicated, how about just logging the whole old
>>>> tuple for version 1?
>>> I think that'd make the patch much less useful because it bloats WAL
>>> unnecessarily for the primary user (replication) of it. I'd rather go
>>> for primary keys only if that proves to be the contentious point.
>>>
>>> How about modifying the selection to go from:
>>> * all rows if ALTER TABLE ... REPLICA IDENTITY NOTHING|FULL;
>>> * index chosen by ALTER TABLE ... REPLICA IDENTITY USING indexname
>>> * [later, maybe] ALTER TABLE ... REPLICA IDENTITY (cola, colb)
>>> * primary key
>>> * candidate key with the smallest oid
>>>
>>> Including the candidate key will help people using changeset extration
>>> for auditing that do not have primary key. That really isn't an
>>> infrequent usecase.
>> As I understand it for a table with *no* unique index,
>> the "candidate key" is the full tuple, so if we get an UPDATE for
>> it then this should be replicated as
>> "UPDATE first row matching (NOT DISTINCT FROM) all columns"
>> which on replay side will be equivalent to
>> CREATE CURSOR ...; FETCH 1 ...; UPDATE ... WHERE CURRENT...'
> No, it's not a candidate key since it's not uniquely identifying a
> row. You can play tricks as you describe, but that still doesn't make
> the whole row a candidate key.
>
> But anyway,  I suggest allowing for logging all columns above...
I the "all columns" option this ?

How about modifying the selection to go from:
* all rows if ALTER TABLE ... REPLICA IDENTITY NOTHING|FULL;

for some reason I thought it to be option to either log or not log PK column ... 

>
>> I know that this will slow down replication, as you can not use direct
>> index updates internally - at least not easily - but need to let postgreSQL
>> actually plan this, but such single row update is no faster on origin
>> either.
> That's not actually true. Consider somebody doing something like:
> UPDATE big_table_without_indexes SET column = ...;
> On the source side that's essentialy O(n). If you replicate on a
> row-by-row basis it will be O(n^2) on the replay side.
Probably more like  O(n^2 / 2) but yes, this is what I meant with the
sentence
after that ;)

Cheers

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ




Re: logical changeset generation v6.2

From
Andres Freund
Date:
On 2013-10-21 16:15:58 +0200, Andres Freund wrote:
> On 2013-10-21 09:32:12 -0400, Robert Haas wrote:
> > On Fri, Oct 18, 2013 at 2:26 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > > I know of the following solutions:
> > > 1) Don't allow VACUUM FULL on catalog tables if wal_level = logical.
> > > 2) Make VACUUM FULL prevent DDL and then wait till all changestreams
> > >    have decoded up to the current point.
> > > 3) don't delete the old relfilenode for VACUUM/CLUSTERs of system tables
> > >    if there are life decoding slots around, instead delegate that
> > >    responsibility to the slot management.
> > > 4) Store both (cmin, cmax) for catalog tuples.
> > >
> > > I bascially think only 1) and 4) are realistic. And 1) sucks.
> > >
> > > I've developed a prototype for 4) and except currently being incredibly
> > > ugly, it seems to be the most promising approach by far. My trick to
> > > store both cmin and cmax is to store cmax in t_hoff managed space when
> > > wal_level = logical.
> >
> > In my opinion, (4) is too ugly to consider.  I think that if we start
> > playing games like this, we're opening up the doors to lots of subtle
> > bugs and future architectural pain that will be with us for many, many
> > years to come.  I believe we will bitterly regret any foray into this
> > area.
>
> Hm. After looking at the required code - which you obviously cannot have
> yet - it's not actually too bad. Will post a patch implementing it later.
>
> I don't really buy the architectural argument since originally cmin/cmax
> *were* both stored. It's not something we're just inventing now. We just
> optimized that away but now have discovered that's not always a good
> idea and thus don't always use the optimization.
>
> The actual decoding code shrinks by about 200 lines using this logic
> which is a hint that it's not a bad idea.

So, here's a preliminary patch to see how this would look. It'd be great
of you comment if you still think it's a completel no-go.

If it were for real, it'd need to be split and some minor things would
need to get adjusted, but I think it's easier to review it seing both
sides at once.

Greetings,

Andres Freund

PS: The patch is ontop of a new git push, but for review that shouldn't
matter.

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: logical changeset generation v6.4

From
Andres Freund
Date:
On 2013-10-18 20:50:58 +0200, Andres Freund wrote:
> How about modifying the selection to go from:
> * all rows if ALTER TABLE ... REPLICA IDENTITY NOTHING|FULL;
> * index chosen by ALTER TABLE ... REPLICA IDENTITY USING indexname
> * [later, maybe] ALTER TABLE ... REPLICA IDENTITY (cola, colb)

Current draft is:
ALTER TABLE ... REPLICA IDENTITY NOTHING|FULL|DEFAULT
ALTER TABLE ... REPLICA IDENTITY USING INDEX ...;

which leaves the door open for

ALTER TABLE ... REPLICA IDENTITY USING '(' column_name_list ')';

Does anybody have a strong feeling about requiring support for CREATE
TABLE for this?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.2

From
Robert Haas
Date:
On Mon, Oct 21, 2013 at 1:52 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > In my opinion, (4) is too ugly to consider.  I think that if we start
>> > playing games like this, we're opening up the doors to lots of subtle
>> > bugs and future architectural pain that will be with us for many, many
>> > years to come.  I believe we will bitterly regret any foray into this
>> > area.
>>
>> Hm. After looking at the required code - which you obviously cannot have
>> yet - it's not actually too bad. Will post a patch implementing it later.
>>
>> I don't really buy the architectural argument since originally cmin/cmax
>> *were* both stored. It's not something we're just inventing now. We just
>> optimized that away but now have discovered that's not always a good
>> idea and thus don't always use the optimization.
>>
>> The actual decoding code shrinks by about 200 lines using this logic
>> which is a hint that it's not a bad idea.
>
> So, here's a preliminary patch to see how this would look. It'd be great
> of you comment if you still think it's a completel no-go.
>
> If it were for real, it'd need to be split and some minor things would
> need to get adjusted, but I think it's easier to review it seing both
> sides at once.

I think it's a complete no-go.  Consider, e.g., the comment for
MaxTupleAttributeNumber, which you've blithely falsified.  Even if you
update the comment and the value, I'm not inspired by the idea of
subtracting 32 from that number; even if it weren't already too small,
it would break pg_upgrade for any users who are on the edge.  Things
aren't looking too good for anything that uses HeapTupleFields,
either; consider rewrite_heap_tuple().

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.2

From
Andres Freund
Date:
On 2013-10-21 14:22:17 -0400, Robert Haas wrote:
> On Mon, Oct 21, 2013 at 1:52 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> > In my opinion, (4) is too ugly to consider.  I think that if we start
> >> > playing games like this, we're opening up the doors to lots of subtle
> >> > bugs and future architectural pain that will be with us for many, many
> >> > years to come.  I believe we will bitterly regret any foray into this
> >> > area.
> >>
> >> Hm. After looking at the required code - which you obviously cannot have
> >> yet - it's not actually too bad. Will post a patch implementing it later.
> >>
> >> I don't really buy the architectural argument since originally cmin/cmax
> >> *were* both stored. It's not something we're just inventing now. We just
> >> optimized that away but now have discovered that's not always a good
> >> idea and thus don't always use the optimization.
> >>
> >> The actual decoding code shrinks by about 200 lines using this logic
> >> which is a hint that it's not a bad idea.
> >
> > So, here's a preliminary patch to see how this would look. It'd be great
> > of you comment if you still think it's a completel no-go.
> >
> > If it were for real, it'd need to be split and some minor things would
> > need to get adjusted, but I think it's easier to review it seing both
> > sides at once.
> 
> I think it's a complete no-go.  Consider, e.g., the comment for
> MaxTupleAttributeNumber, which you've blithely falsified.  Even if you
> update the comment and the value, I'm not inspired by the idea of
> subtracting 32 from that number; even if it weren't already too small,
> it would break pg_upgrade for any users who are on the edge.

Well, we only need to support it for (user_)catalog tables. So
pg_upgrade isn't a problem. And I don't really see a problem restricting
the number of columns for such tables.

> Things
> aren't looking too good for anything that uses HeapTupleFields,
> either; consider rewrite_heap_tuple().

Well, that currently works, by copying cmax. Since rewriting triggered
the change, I am pretty sure I've actually tested & hit that path...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.2

From
Robert Haas
Date:
On Mon, Oct 21, 2013 at 2:27 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> I think it's a complete no-go.  Consider, e.g., the comment for
>> MaxTupleAttributeNumber, which you've blithely falsified.  Even if you
>> update the comment and the value, I'm not inspired by the idea of
>> subtracting 32 from that number; even if it weren't already too small,
>> it would break pg_upgrade for any users who are on the edge.
>
> Well, we only need to support it for (user_)catalog tables. So
> pg_upgrade isn't a problem. And I don't really see a problem restricting
> the number of columns for such tables.

Inch by inch, this patch set is trying to make catalog tables more and
more different from regular tables.  I think that's a direction we're
going to regret.  I can almost believe that no great harm will come to
us from giving the two different xmin horizons, as previously
discussed, though I'm still not 100% happy about that.  Can't both
have something be a catalog table AND replicate it?  Ick, but OK.

But changing the on disk format strikes me as crossing some sort of
bright line.  That means that you're going to have two different code
paths in a lot of important cases, one for catalog tables and one for
non-catalog tables, and the one that's only taken for catalog tables
will be rather lightly traveled.  And then you've got user catalog
tables, and the possibility that something that wasn't formerly a user
catalog table might become one, or visca versa.  Even if you can flush
out every bug that exists today, this is a recipe for future bugs.

>> Things
>> aren't looking too good for anything that uses HeapTupleFields,
>> either; consider rewrite_heap_tuple().
>
> Well, that currently works, by copying cmax. Since rewriting triggered
> the change, I am pretty sure I've actually tested & hit that path...

No offense, but that's a laughable statement.  If that path works,
it's mostly if not entirely by accident.  You've fundamentally changed
the heap tuple format, and that code doesn't know about it, even
though it's deeply in bed with assumptions about the old format.  I
think this is a pretty clear indication as to what's wrong with this
approach: a lot of stuff will not care, but the stuff that does care
will be hard to find, and future incremental modifications either to
that code or to the hidden data before the t_hoff pointer could break
stuff that formerly worked.  We rejected early drafts of sepgsql RLS
cold because they changed the tuple format, and I don't see why we
shouldn't do exactly the same thing here.

But just suppose for a minute that we'd accepted that proposal and
then took this one, too.  And let's suppose we also accept the next
proposal that, like that one and this one, jams something more into
the heap tuple header.  At that point you'll have potentially as many
as 8 different maximum-number-of-attributes values for tuples, though
maybe not quite that many in practice if not all of those features can
be used together.  The macros that are needed to extract the various
values from the heap tuple will be nightmarishly complex, and we'll
have eaten up all (or more than all) of our remaining bit-space in the
infomask.  Maybe all of that sounds all right to you, but to me it
sounds like a big mess.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.2

From
Andres Freund
Date:
On 2013-10-21 14:50:54 -0400, Robert Haas wrote:
> On Mon, Oct 21, 2013 at 2:27 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> I think it's a complete no-go.  Consider, e.g., the comment for
> >> MaxTupleAttributeNumber, which you've blithely falsified.  Even if you
> >> update the comment and the value, I'm not inspired by the idea of
> >> subtracting 32 from that number; even if it weren't already too small,
> >> it would break pg_upgrade for any users who are on the edge.
> >
> > Well, we only need to support it for (user_)catalog tables. So
> > pg_upgrade isn't a problem. And I don't really see a problem restricting
> > the number of columns for such tables.
> 
> Inch by inch, this patch set is trying to make catalog tables more and
> more different from regular tables.  I think that's a direction we're
> going to regret.

I can understand that.

> Can't both have something be a catalog table AND replicate it?  Ick,
> but OK.

User catalog tables are replicated normally.

If we want we can replicate system catalog tables as well, ok, it's
about a days worth of work. I just don't see what the use case would
be and I am afraid it'd be used as a poor man's system table trigger.

> But changing the on disk format strikes me as crossing some sort of
> bright line.  That means that you're going to have two different code
> paths in a lot of important cases, one for catalog tables and one for
> non-catalog tables, and the one that's only taken for catalog tables
> will be rather lightly traveled.

But there really isn't that much difference in the code paths, is there?
If you look at it, minus the mechanical changes around bitmasks it
really mostly just a couple of added
HeapTupleHeaderSetCmin/HeapTupleHeaderSetCmax calls + wal logging the
current cid.

> And then you've got user catalog tables, and the possibility that
> something that wasn't formerly a user catalog table might become one,
> or visca versa.

That's not really a problem - we only need cmin/cmax for catalog tables
when decoding DML that happened in the same transaction as DDL. The wide
cids could easily be removed when we freeze tuples or such.
Even something like:
BEGIN;
CREATE TABLE catalog_t(...);
INSERT INTO catalog_t ...;
UPDATE catalog_t ...
ALTER TABLE catalog_t SET (user_catalog_table = true);
INSERT INTO decoded_table (..);
INSERT INTO decoded_table (..);
UPDATE catalog_t SET ...;
INSERT INTO decoded_table (..);
COMMIT;
will work correctly.

> Even if you can flush out every bug that exists today, this is a
> recipe for future bugs.

I don't really forsee many new codepaths that care about the visibility
bits in tuple headers themselves. When has the last one of those been
added/changed? I'd bet it was 9.0 with the vacuum full/cluster merge,
and three lines to be added are surely the smallest problem of that.

> >> Things
> >> aren't looking too good for anything that uses HeapTupleFields,
> >> either; consider rewrite_heap_tuple().
> >
> > Well, that currently works, by copying cmax. Since rewriting triggered
> > the change, I am pretty sure I've actually tested & hit that path...
> 
> No offense, but that's a laughable statement.  If that path works,
> it's mostly if not entirely by accident.  You've fundamentally changed
> the heap tuple format, and that code doesn't know about it, even
> though it's deeply in bed with assumptions about the old format.

Hm? rewrite_heap_tuple() grew code to handle that case, it's not an
accident that it works.
if (HeapTupleHeaderHasWideCid(old_tuple->t_data)           && HeapTupleHeaderHasWideCid(new_tuple->t_data)){
HeapTupleHeaderSetCmin(new_tuple->t_data,
HeapTupleHeaderGetCmin(old_tuple->t_data));   HeapTupleHeaderSetCmax(new_tuple->t_data,
    HeapTupleHeaderGetCmax(old_tuple->t_data), false);}
 


Note that code handling HeapTupleHeaders outside of heapam.c, tqual.c et
al. shouldn't need to care about the changes at all.
And all the code that needs to care *already* has special-cased code
around visibility. It's relatively easy to find by grepping for
HEAP_XACT_MASK. We've so far accepted that several places need to change
if we change the visibility rules. And I don't see how we easily could
get away from that.
Patches like e.g. lsn-ranges freezing will require *gobs* more
widespread changes. And have much higher chances of breaking stuff
imo. And it's still worthwile to do them.

> I
> think this is a pretty clear indication as to what's wrong with this
> approach: a lot of stuff will not care, but the stuff that does care
> will be hard to find, and future incremental modifications either to
> that code or to the hidden data before the t_hoff pointer could break
> stuff that formerly worked.  We rejected early drafts of sepgsql RLS
> cold because they changed the tuple format, and I don't see why we
> shouldn't do exactly the same thing here.

Ok, I have a hard time to argue against that. I either skipped or forgot
that discussion.
I quickly searched and looked at the patch at:
497D7055.9090806@ak.jp.nec.com From a quick look that would have grown
much more invasive - and lots more places would have had to care. For
the proposed patch it really is only
heap_(insert|multi_insert|update|delete|rewrite_tuple) that need to
care. We could remove the tupledesc changes if we dont care about some
added malloc/memcpy'ing during DDL, the code to handle tuples without
that space is already there.

> But just suppose for a minute that we'd accepted that proposal and
> then took this one, too.  And let's suppose we also accept the next
> proposal that, like that one and this one, jams something more into
> the heap tuple header.  At that point you'll have potentially as many
> as 8 different maximum-number-of-attributes values for tuples, though
> maybe not quite that many in practice if not all of those features can
> be used together.

Well, I can understand that argument. Unfortunately. But I still claim
that this is scaling back an optimization that we've previously applied
more widely.

Note also that we actually have quite some slop in
MaxTupleAttributeNumber and MaxHeapAttributeNumber (which is the one
that matters here). They e.g. haven't been adjusted when 8.3 decided
*not* to store both cmin and cmax anymore and they *already* had slop
before. So I am not convinced that we actually need to change our
current limits.

> The macros that are needed to extract the various
> values from the heap tuple will be nightmarishly complex, and we'll
> have eaten up all (or more than all) of our remaining bit-space in the
> infomask.

Why would the macros for extracting values from heap tuples need to
change? Those shouldn't be affected by what I proposed?


I think this approach would have lower maintenance overhead in
comparison to the previous solution of Handling CommandIds because it's
actually much simpler. But I am definitely ready to try other ideas.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.2

From
Andres Freund
Date:
Hi,


On 2013-10-21 21:36:02 +0200, Andres Freund wrote:
> I think this approach would have lower maintenance overhead in
> comparison to the previous solution of Handling CommandIds because it's
> actually much simpler.

Btw, I think the new approach would allow for *easier* modifications
about future code caring about visibility. Since the physical location
doesn't matter anymore it'd be much more friendly towards things like
an online compacting VACUUM and such.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.4

From
Andres Freund
Date:
On 2013-10-21 20:16:29 +0200, Andres Freund wrote:
> On 2013-10-18 20:50:58 +0200, Andres Freund wrote:
> > How about modifying the selection to go from:
> > * all rows if ALTER TABLE ... REPLICA IDENTITY NOTHING|FULL;
> > * index chosen by ALTER TABLE ... REPLICA IDENTITY USING indexname
> > * [later, maybe] ALTER TABLE ... REPLICA IDENTITY (cola, colb)
>
> Current draft is:
> ALTER TABLE ... REPLICA IDENTITY NOTHING|FULL|DEFAULT
> ALTER TABLE ... REPLICA IDENTITY USING INDEX ...;
>
> which leaves the door open for
>
> ALTER TABLE ... REPLICA IDENTITY USING '(' column_name_list ')';
>
> Does anybody have a strong feeling about requiring support for CREATE
> TABLE for this?

Attached is a patch ontop of master implementing this syntax. It's not
wired up to the changeset extraction patch yet as I am not sure whether
others agree about the storage.

pg_class grew a 'relreplident' char, storing:
* 'd' default
* 'n' nothing
* 'f' full
* 'i' index with indisreplident set, or default if index has been
      dropped
pg_index grew a 'indisreplident' bool indicating it is set as the
replica identity for a replident = 'i' relation.

Both changes shouldn't change the width of the affected relations, they
should reuse existing padding.

Does somebody prefer a different storage?

pg_dump support, psql completion, regression tests and minimal docs
included.

I am not 100% clear what the best way to handle
ALTER TABLE some_table REPLICA IDENTITY USING INDEX someindex;
DROP INDEX someindex;
is. Currently that's supposed to have the same effect as having
relreplident = 'd'.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: logical changeset generation v6.2

From
Robert Haas
Date:
On Fri, Oct 18, 2013 at 2:26 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> So. As it turns out that solution isn't sufficient in the face of VACUUM
> FULL and mixed DML/DDL transaction that have not yet been decoded.
>
> To reiterate, as published it works like:
> For every modification of catalog tuple (insert, multi_insert, update,
> delete) that has influence over visibility issue a record that contains:
> * filenode
> * ctid
> * (cmin, cmax)
>
> When doing a visibility check on a catalog row during decoding of mixed
> DML/DDL transaction lookup (cmin, cmax) for that row since we don't
> store both for the tuple.
>
> That mostly works great.
>
> The problematic scenario is decoding a transaction that has done mixed
> DML/DDL *after* a VACUUM FULL/CLUSTER has been performed. The VACUUM
> FULL obviously changes the filenode and the ctid of a tuple, so we
> cannot successfully do a lookup based on what we logged before.

So I have a new idea for handling this problem, which seems obvious in
retrospect.  What if we make the VACUUM FULL or CLUSTER log the old
CTID -> new CTID mappings?  This would only need to be done for
catalog tables, and maybe could be skipped for tuples whose XIDs are
old enough that we know those transactions must already be decoded.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.4

From
Robert Haas
Date:
On Tue, Oct 22, 2013 at 10:07 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-10-21 20:16:29 +0200, Andres Freund wrote:
>> On 2013-10-18 20:50:58 +0200, Andres Freund wrote:
>> > How about modifying the selection to go from:
>> > * all rows if ALTER TABLE ... REPLICA IDENTITY NOTHING|FULL;
>> > * index chosen by ALTER TABLE ... REPLICA IDENTITY USING indexname
>> > * [later, maybe] ALTER TABLE ... REPLICA IDENTITY (cola, colb)
>>
>> Current draft is:
>> ALTER TABLE ... REPLICA IDENTITY NOTHING|FULL|DEFAULT
>> ALTER TABLE ... REPLICA IDENTITY USING INDEX ...;
>>
>> which leaves the door open for
>>
>> ALTER TABLE ... REPLICA IDENTITY USING '(' column_name_list ')';
>>
>> Does anybody have a strong feeling about requiring support for CREATE
>> TABLE for this?
>
> Attached is a patch ontop of master implementing this syntax. It's not
> wired up to the changeset extraction patch yet as I am not sure whether
> others agree about the storage.
>
> pg_class grew a 'relreplident' char, storing:
> * 'd' default
> * 'n' nothing
> * 'f' full
> * 'i' index with indisreplident set, or default if index has been
>       dropped
> pg_index grew a 'indisreplident' bool indicating it is set as the
> replica identity for a replident = 'i' relation.
>
> Both changes shouldn't change the width of the affected relations, they
> should reuse existing padding.
>
> Does somebody prefer a different storage?

I had imagined that the storage might consist simply of a pg_attribute
boolean.  So full would turn them all on, null would turn them all of,
etc.  But that does make it hard to implement the "whatever the pkey
happens to be right now" behavior, so maybe your idea is better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.2

From
Andres Freund
Date:
On 2013-10-22 10:52:48 -0400, Robert Haas wrote:
> On Fri, Oct 18, 2013 at 2:26 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > So. As it turns out that solution isn't sufficient in the face of VACUUM
> > FULL and mixed DML/DDL transaction that have not yet been decoded.
> >
> > To reiterate, as published it works like:
> > For every modification of catalog tuple (insert, multi_insert, update,
> > delete) that has influence over visibility issue a record that contains:
> > * filenode
> > * ctid
> > * (cmin, cmax)
> >
> > When doing a visibility check on a catalog row during decoding of mixed
> > DML/DDL transaction lookup (cmin, cmax) for that row since we don't
> > store both for the tuple.
> >
> > That mostly works great.
> >
> > The problematic scenario is decoding a transaction that has done mixed
> > DML/DDL *after* a VACUUM FULL/CLUSTER has been performed. The VACUUM
> > FULL obviously changes the filenode and the ctid of a tuple, so we
> > cannot successfully do a lookup based on what we logged before.
> 
> So I have a new idea for handling this problem, which seems obvious in
> retrospect.  What if we make the VACUUM FULL or CLUSTER log the old
> CTID -> new CTID mappings?  This would only need to be done for
> catalog tables, and maybe could be skipped for tuples whose XIDs are
> old enough that we know those transactions must already be decoded.

Ah. If it only were so simple ;). That was my first idea, and after I'd
bragged in an 2ndq internal chat that I'd found a simple idea I
obviously had to realize it doesn't work.

Consider:
INIT_LOGICAL_REPLICATION;
CREATE TABLE foo(...);
BEGIN;
INSERT INTO foo;
ALTER TABLE foo ...;
INSERT INTO foo;
COMMIT TX 3;
VACUUM FULL pg_class;
START_LOGICAL_REPLICATION;

When we decode tx 3 we haven't yet read the mapping from the vacuum
freeze. That scenario can happen either because decoding was stopped for
a moment, or because decoding couldn't keep up (slow connection,
whatever).

There also can be nasty variations where the VACUUM FULL happens while a
transaction is writing data since we don't hold locks on system
relations for very long.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.2

From
Robert Haas
Date:
On Tue, Oct 22, 2013 at 11:02 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-10-22 10:52:48 -0400, Robert Haas wrote:
>> On Fri, Oct 18, 2013 at 2:26 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > So. As it turns out that solution isn't sufficient in the face of VACUUM
>> > FULL and mixed DML/DDL transaction that have not yet been decoded.
>> >
>> > To reiterate, as published it works like:
>> > For every modification of catalog tuple (insert, multi_insert, update,
>> > delete) that has influence over visibility issue a record that contains:
>> > * filenode
>> > * ctid
>> > * (cmin, cmax)
>> >
>> > When doing a visibility check on a catalog row during decoding of mixed
>> > DML/DDL transaction lookup (cmin, cmax) for that row since we don't
>> > store both for the tuple.
>> >
>> > That mostly works great.
>> >
>> > The problematic scenario is decoding a transaction that has done mixed
>> > DML/DDL *after* a VACUUM FULL/CLUSTER has been performed. The VACUUM
>> > FULL obviously changes the filenode and the ctid of a tuple, so we
>> > cannot successfully do a lookup based on what we logged before.
>>
>> So I have a new idea for handling this problem, which seems obvious in
>> retrospect.  What if we make the VACUUM FULL or CLUSTER log the old
>> CTID -> new CTID mappings?  This would only need to be done for
>> catalog tables, and maybe could be skipped for tuples whose XIDs are
>> old enough that we know those transactions must already be decoded.
>
> Ah. If it only were so simple ;). That was my first idea, and after I'd
> bragged in an 2ndq internal chat that I'd found a simple idea I
> obviously had to realize it doesn't work.
>
> Consider:
> INIT_LOGICAL_REPLICATION;
> CREATE TABLE foo(...);
> BEGIN;
> INSERT INTO foo;
> ALTER TABLE foo ...;
> INSERT INTO foo;
> COMMIT TX 3;
> VACUUM FULL pg_class;
> START_LOGICAL_REPLICATION;
>
> When we decode tx 3 we haven't yet read the mapping from the vacuum
> freeze. That scenario can happen either because decoding was stopped for
> a moment, or because decoding couldn't keep up (slow connection,
> whatever).

That strikes me as a flaw in the implementation rather than the idea.
You're presupposing a patch where the necessary information is
available in WAL yet you don't make use of it at the proper time.  It
seems to me that you have to think of the CTID map as tied to a
relfilenode; if you try to use one relfilenode's map with a different
relfilenode, it's obviously not going to work.  So don't do that.

/me looks innocent.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.2

From
Andres Freund
Date:
On 2013-10-18 20:26:16 +0200, Andres Freund wrote:
> 4) Store both (cmin, cmax) for catalog tuples.

BTW: That would have the nice side-effect of delivering the basis of
what you need to do parallel sort in a transaction that previously has
performed DDL.

Currently you cannot do anything in parallel after DDL, even if you only
scan the table in one backend, because operators et al. have to do
catalog lookups which you can't do consistently since cmin/cmax aren't
available in both.

Greetings,

Andres Freund



Re: logical changeset generation v6.2

From
Heikki Linnakangas
Date:
On 22.10.2013 19:12, Andres Freund wrote:
> On 2013-10-18 20:26:16 +0200, Andres Freund wrote:
>> 4) Store both (cmin, cmax) for catalog tuples.
>
> BTW: That would have the nice side-effect of delivering the basis of
> what you need to do parallel sort in a transaction that previously has
> performed DDL.
>
> Currently you cannot do anything in parallel after DDL, even if you only
> scan the table in one backend, because operators et al. have to do
> catalog lookups which you can't do consistently since cmin/cmax aren't
> available in both.

Parallel workers will need cmin/cmax for user tables too, to know which 
tuples are visible to the snapshot.

- Heikki



Re: logical changeset generation v6.2

From
Andres Freund
Date:
On 2013-10-22 19:19:19 +0300, Heikki Linnakangas wrote:
> On 22.10.2013 19:12, Andres Freund wrote:
> >On 2013-10-18 20:26:16 +0200, Andres Freund wrote:
> >>4) Store both (cmin, cmax) for catalog tuples.
> >
> >BTW: That would have the nice side-effect of delivering the basis of
> >what you need to do parallel sort in a transaction that previously has
> >performed DDL.
> >
> >Currently you cannot do anything in parallel after DDL, even if you only
> >scan the table in one backend, because operators et al. have to do
> >catalog lookups which you can't do consistently since cmin/cmax aren't
> >available in both.
> 
> Parallel workers will need cmin/cmax for user tables too, to know which
> tuples are visible to the snapshot.

The existing proposals were mostly about just parallelizing the sort and
similar operations, right? In such scenarios you really need it only for
the catalog.

But we could easily generalize it for user data too. We should even be
able to only use "wide cids" when we a backend needs it it since
inherently it's only needed within a single transaction.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.2

From
Heikki Linnakangas
Date:
On 22.10.2013 19:23, Andres Freund wrote:
> On 2013-10-22 19:19:19 +0300, Heikki Linnakangas wrote:
>> On 22.10.2013 19:12, Andres Freund wrote:
>>> On 2013-10-18 20:26:16 +0200, Andres Freund wrote:
>>>> 4) Store both (cmin, cmax) for catalog tuples.
>>>
>>> BTW: That would have the nice side-effect of delivering the basis of
>>> what you need to do parallel sort in a transaction that previously has
>>> performed DDL.
>>>
>>> Currently you cannot do anything in parallel after DDL, even if you only
>>> scan the table in one backend, because operators et al. have to do
>>> catalog lookups which you can't do consistently since cmin/cmax aren't
>>> available in both.
>>
>> Parallel workers will need cmin/cmax for user tables too, to know which
>> tuples are visible to the snapshot.
>
> The existing proposals were mostly about just parallelizing the sort and
> similar operations, right? In such scenarios you really need it only for
> the catalog.
>
> But we could easily generalize it for user data too. We should even be
> able to only use "wide cids" when we a backend needs it it since
> inherently it's only needed within a single transaction.

Or just hand over a copy of the combocid map to the worker, along with 
the snapshot. Seems a lot simpler than this wide cids business..

- Heikki



Re: logical changeset generation v6.2

From
Robert Haas
Date:
On Tue, Oct 22, 2013 at 12:25 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Or just hand over a copy of the combocid map to the worker, along with the
> snapshot. Seems a lot simpler than this wide cids business..

Yes, that's what Noah and I talked about doing.  Or possibly even
making the map into a hash table in dynamic shared memory, so that new
combo CIDs could be allocated by any backend in the parallel group.
But that seems hard, so for starters I think we'll only parallelize
read-only operations and just hand over a copy of the map.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.2

From
Andres Freund
Date:
On 2013-10-22 19:25:31 +0300, Heikki Linnakangas wrote:
> On 22.10.2013 19:23, Andres Freund wrote:
> >On 2013-10-22 19:19:19 +0300, Heikki Linnakangas wrote:
> >>On 22.10.2013 19:12, Andres Freund wrote:
> >>>On 2013-10-18 20:26:16 +0200, Andres Freund wrote:
> >>>>4) Store both (cmin, cmax) for catalog tuples.
> >>>
> >>>BTW: That would have the nice side-effect of delivering the basis of
> >>>what you need to do parallel sort in a transaction that previously has
> >>>performed DDL.
> >>>
> >>>Currently you cannot do anything in parallel after DDL, even if you only
> >>>scan the table in one backend, because operators et al. have to do
> >>>catalog lookups which you can't do consistently since cmin/cmax aren't
> >>>available in both.
> >>
> >>Parallel workers will need cmin/cmax for user tables too, to know which
> >>tuples are visible to the snapshot.
> >
> >The existing proposals were mostly about just parallelizing the sort and
> >similar operations, right? In such scenarios you really need it only for
> >the catalog.
> >
> >But we could easily generalize it for user data too. We should even be
> >able to only use "wide cids" when we a backend needs it it since
> >inherently it's only needed within a single transaction.
> 
> Or just hand over a copy of the combocid map to the worker, along with the
> snapshot. Seems a lot simpler than this wide cids business..

That's not sufficient if you want to continue writing in the primary
backend though which isn't an uninteresting thing.

I am not saying that parallel XXX is a sufficient reason for this, just
that it might a be a co-benefactor.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.2

From
Andres Freund
Date:
On 2013-10-22 11:59:35 -0400, Robert Haas wrote:
> >> So I have a new idea for handling this problem, which seems obvious in
> >> retrospect.  What if we make the VACUUM FULL or CLUSTER log the old
> >> CTID -> new CTID mappings?  This would only need to be done for
> >> catalog tables, and maybe could be skipped for tuples whose XIDs are
> >> old enough that we know those transactions must already be decoded.
> >
> > Ah. If it only were so simple ;). That was my first idea, and after I'd
> > bragged in an 2ndq internal chat that I'd found a simple idea I
> > obviously had to realize it doesn't work.
> >
> > Consider:
> > INIT_LOGICAL_REPLICATION;
> > CREATE TABLE foo(...);
> > BEGIN;
> > INSERT INTO foo;
> > ALTER TABLE foo ...;
> > INSERT INTO foo;
> > COMMIT TX 3;
> > VACUUM FULL pg_class;
> > START_LOGICAL_REPLICATION;
> >
> > When we decode tx 3 we haven't yet read the mapping from the vacuum
> > freeze. That scenario can happen either because decoding was stopped for
> > a moment, or because decoding couldn't keep up (slow connection,
> > whatever).

> It seems to me that you have to think of the CTID map as tied to a
> relfilenode; if you try to use one relfilenode's map with a different
> relfilenode, it's obviously not going to work.  So don't do that.

It has to be tied to relfilenode (+ctid) *and* transaction
unfortunately.
> That strikes me as a flaw in the implementation rather than the idea.
> You're presupposing a patch where the necessary information is
> available in WAL yet you don't make use of it at the proper time.

The problem is that the mapping would be somewhere *ahead* from the
transaction/WAL we're currently decoding. We'd need to read ahead till
we find the correct one.
But I think I mainly misunderstood what you proposed. That mapping could
be written besides relfilenode, instead of into the WAL. Then my
imagined problem doesn't exist anymore.

We only would need to write out mappings for tuples modified since the
xmin horizon, so it wouldn't even be *too* bad for bigger relations.

This won't easily work for two+ rewrites because we'd need to apply all
mappings in order and thus would have to keep a history of intermediate
nodes/mappings. But it'd be perfectly doable to simply wait till
decoders are caught up.

I still "feel" that simply storing both cmin, cmax is cleaner, but if
that's not acceptable, I can certainly live with something like this.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.2

From
Robert Haas
Date:
On Tue, Oct 22, 2013 at 1:08 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> It seems to me that you have to think of the CTID map as tied to a
>> relfilenode; if you try to use one relfilenode's map with a different
>> relfilenode, it's obviously not going to work.  So don't do that.
>
> It has to be tied to relfilenode (+ctid) *and* transaction
> unfortunately.

I agree that it does, but it doesn't seem particularly unfortunate to me.

>> That strikes me as a flaw in the implementation rather than the idea.
>> You're presupposing a patch where the necessary information is
>> available in WAL yet you don't make use of it at the proper time.
>
> The problem is that the mapping would be somewhere *ahead* from the
> transaction/WAL we're currently decoding. We'd need to read ahead till
> we find the correct one.

Yes, I think that's what you need to do.

> But I think I mainly misunderstood what you proposed. That mapping could
> be written besides relfilenode, instead of into the WAL. Then my
> imagined problem doesn't exist anymore.

That's pretty ugly.  I think it should be written into WAL.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.2

From
Andres Freund
Date:
On 2013-10-22 13:57:53 -0400, Robert Haas wrote:
> On Tue, Oct 22, 2013 at 1:08 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> That strikes me as a flaw in the implementation rather than the idea.
> >> You're presupposing a patch where the necessary information is
> >> available in WAL yet you don't make use of it at the proper time.
> >
> > The problem is that the mapping would be somewhere *ahead* from the
> > transaction/WAL we're currently decoding. We'd need to read ahead till
> > we find the correct one.
> 
> Yes, I think that's what you need to do.

My problem with that is that rewrite can be gigabytes into the future.

When reading forward we could either just continue reading data into the
reorderbuffer, but delay replaying all future commits till we found the
currently needed remap. That might have quite the additional
storage/memory cost, but runtime complexity should be the same as normal
decoding.
Or we could individually read ahead for every transaction. But doing so
for every transaction will get rather expensive (rougly O(amount_of_wal^2)).

I think that'd be pretty similar to just disallowing VACUUM
FREEZE/CLUSTER on catalog relations since effectively it'd be to
expensive to use.

> > But I think I mainly misunderstood what you proposed. That mapping could
> > be written besides relfilenode, instead of into the WAL. Then my
> > imagined problem doesn't exist anymore.
> 
> That's pretty ugly.  I think it should be written into WAL.

It basically has O(1) access, that's why I was thinking about it.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.2

From
Robert Haas
Date:
On Tue, Oct 22, 2013 at 2:13 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-10-22 13:57:53 -0400, Robert Haas wrote:
>> On Tue, Oct 22, 2013 at 1:08 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> >> That strikes me as a flaw in the implementation rather than the idea.
>> >> You're presupposing a patch where the necessary information is
>> >> available in WAL yet you don't make use of it at the proper time.
>> >
>> > The problem is that the mapping would be somewhere *ahead* from the
>> > transaction/WAL we're currently decoding. We'd need to read ahead till
>> > we find the correct one.
>>
>> Yes, I think that's what you need to do.
>
> My problem with that is that rewrite can be gigabytes into the future.
>
> When reading forward we could either just continue reading data into the
> reorderbuffer, but delay replaying all future commits till we found the
> currently needed remap. That might have quite the additional
> storage/memory cost, but runtime complexity should be the same as normal
> decoding.
> Or we could individually read ahead for every transaction. But doing so
> for every transaction will get rather expensive (rougly O(amount_of_wal^2)).

[ Sorry it's taken me a bit of time to get back to this; other tasks
intervened, and I also just needed some time to let it settle in my
brain. ]

If you read ahead looking for a set of ctid translations from
relfilenode A to relfilenode B, and along the way you happen to
encounter a set of translations from relfilenode C to relfilenode D,
you could stash that set of translations away somewhere, so that if
the next transaction you process needs that set of mappings, it's
already computed.  With that approach, you'd never have to pre-read
the same set of WAL files more than once.

But, as I think about it more, that's not very different from your
idea of stashing the translations someplace other than WAL in the
first place.  I mean, if the read-ahead thread generates a series of
files in pg_somethingorother that contain those maps, you could have
just written the maps to that directory in the first place.  So on
further review I think we could adopt that approach.

However, I'm leery about the idea of using a relation fork for this.
I'm not sure whether that's what you had it mind, but it gives me the
willies.  First, it adds distributed overhead to the system, as
previously discussed; and second, I think the accounting may be kind
of tricky, especially in the face of multiple rewrites.  I'd be more
inclined to find a separate place to store the mappings.  Note that,
AFAICS, there's no real need for the mapping file to be
block-structured, and I believe they'll be written first (with no
readers) and subsequently only read (with no further writes) and
eventually deleted.

One possible objection to this is that it would preclude decoding on a
standby, which seems like a likely enough thing to want to do.  So
maybe it's best to WAL-log the changes to the mapping file so that the
standby can reconstruct it if needed.

> I think that'd be pretty similar to just disallowing VACUUM
> FREEZE/CLUSTER on catalog relations since effectively it'd be to
> expensive to use.

This seems unduly pessimistic to me; unless the catalogs are really
darn big, this is a mostly theoretical problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.2

From
Andres Freund
Date:
On 2013-10-24 10:59:21 -0400, Robert Haas wrote:
> On Tue, Oct 22, 2013 at 2:13 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2013-10-22 13:57:53 -0400, Robert Haas wrote:
> >> On Tue, Oct 22, 2013 at 1:08 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> >> That strikes me as a flaw in the implementation rather than the idea.
> >> >> You're presupposing a patch where the necessary information is
> >> >> available in WAL yet you don't make use of it at the proper time.
> >> >
> >> > The problem is that the mapping would be somewhere *ahead* from the
> >> > transaction/WAL we're currently decoding. We'd need to read ahead till
> >> > we find the correct one.
> >>
> >> Yes, I think that's what you need to do.
> >
> > My problem with that is that rewrite can be gigabytes into the future.
> >
> > When reading forward we could either just continue reading data into the
> > reorderbuffer, but delay replaying all future commits till we found the
> > currently needed remap. That might have quite the additional
> > storage/memory cost, but runtime complexity should be the same as normal
> > decoding.
> > Or we could individually read ahead for every transaction. But doing so
> > for every transaction will get rather expensive (rougly O(amount_of_wal^2)).
> 
> [ Sorry it's taken me a bit of time to get back to this; other tasks
> intervened, and I also just needed some time to let it settle in my
> brain. ]

No worries. I've had enough things to work on ;)

> If you read ahead looking for a set of ctid translations from
> relfilenode A to relfilenode B, and along the way you happen to
> encounter a set of translations from relfilenode C to relfilenode D,
> you could stash that set of translations away somewhere, so that if
> the next transaction you process needs that set of mappings, it's
> already computed.  With that approach, you'd never have to pre-read
> the same set of WAL files more than once.

> But, as I think about it more, that's not very different from your
> idea of stashing the translations someplace other than WAL in the
> first place.  I mean, if the read-ahead thread generates a series of
> files in pg_somethingorother that contain those maps, you could have
> just written the maps to that directory in the first place.  So on
> further review I think we could adopt that approach.

Yea, that basically was my reasoning, only expressed much more nicely ;)

> However, I'm leery about the idea of using a relation fork for this.
> I'm not sure whether that's what you had it mind, but it gives me the
> willies.  First, it adds distributed overhead to the system, as
> previously discussed; and second, I think the accounting may be kind
> of tricky, especially in the face of multiple rewrites.  I'd be more
> inclined to find a separate place to store the mappings.  Note that,
> AFAICS, there's no real need for the mapping file to be
> block-structured, and I believe they'll be written first (with no
> readers) and subsequently only read (with no further writes) and
> eventually deleted.

I was thinking of storing it along other data used during logical
decoding and let decoding's cleanup clean up that data as well. All the
information for that should be there.

There's one snag I currently can see, namely that we actually need to
prevent that a formerly dropped relfilenode is getting reused. Not
entirely sure what the best way for that is.

> One possible objection to this is that it would preclude decoding on a
> standby, which seems like a likely enough thing to want to do.  So
> maybe it's best to WAL-log the changes to the mapping file so that the
> standby can reconstruct it if needed.

The mapping file probably can be one big wal record, so it should be
easy enough to do.

For a moment I thought there's a problem with decoding on the standby
having to read ahead of the current location to find the newer mapping,
but that's actually not required since we're protected by the AEL lock
during rewrites on the standby as well.

> > I think that'd be pretty similar to just disallowing VACUUM
> > FREEZE/CLUSTER on catalog relations since effectively it'd be to
> > expensive to use.
> 
> This seems unduly pessimistic to me; unless the catalogs are really
> darn big, this is a mostly theoretical problem.

Well, it's not the size of the relation, but the amount of concurrent
WAL that's being generated that matters. But anyway, if we do it like
you described above that shouldn't be a problem.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.2

From
Andres Freund
Date:
On 2013-10-21 16:15:58 +0200, Andres Freund wrote:
> > I don't think I understand exactly what you have in mind for (2); can
> > you elaborate?  I have always thought that having a
> > WaitForDecodingToCatchUp() primitive was a good way of handling
> > changes that were otherwise too difficult to track our way through.  I
> > am not sure you're doing that at all right now, which in some sense I
> > guess is fine, but I haven't really understood your aversion to this
> > solution.  There are some locking issues to be worked out here, but
> > the problems don't seem altogether intractable.
> 
> So, what we need to do for rewriting catalog tables would be:
> 1) lock table against writes
> 2) wait for all in-progress xacts to finish, they could have modified
>    the table in question (we don't keep locks on system tables)
> 3) acquire xlog insert pointer
> 4) wait for all logical decoding actions to read past that pointer
> 5) upgrade the lock to an access exclusive one
> 6) perform vacuum full as usual
> 
> The lock upgrade hazards in here are the reason I am adverse to the
> solution. And I don't see how we can avoid them, since in order for
> decoding to catchup it has to be able to read from the
> catalog... Otherwise it's easy enough to implement.

So, I thought about this for some more and I think I've a partial
solution to the problem.

The worst thing about deadlocks that occur in the above is that they
could be the VACUUM FULL waiting for the "restart LSN"[1] of a decoding
slot to progress, but the restart LSN cannot progress because the slot
is waiting for a xid/transaction to end which is being blocked by the
lock upgrade from VACUUM FULL. Such conflicts are not visible to the
deadlock detector, which obviously is bad.
I've prototyped this (~25 lines) and this happens pretty frequently. But
it turns out that we can actually fix this by exporting (to shared
memory) the oldest in-progress xid of a decoding slot. Then the waiting
code can do a XactLockTableWait() for that xid...

I wonder if this is isn't maybe sufficient. Yes, it can deadlock, but
that's already the case for VACUUM FULLs of system tables, although less
likely. And it will be detected/handled.
There's one more snag though, we currently allow CLUSTER system_table;
in an existing transaction. I think that'd have to be disallowed.

What do you think?

Greetings,

Andres Freund

[1] The "restart LSN" is the point from where we need to be able read
WAL to replay all changes the receiving side hasn't acked yet.

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.4

From
Andres Freund
Date:
Hi,

On 2013-10-22 16:07:16 +0200, Andres Freund wrote:
> On 2013-10-21 20:16:29 +0200, Andres Freund wrote:
> > Current draft is:
> > ALTER TABLE ... REPLICA IDENTITY NOTHING|FULL|DEFAULT
> > ALTER TABLE ... REPLICA IDENTITY USING INDEX ...;
> > 
> > which leaves the door open for
> > 
> > ALTER TABLE ... REPLICA IDENTITY USING '(' column_name_list ')';
> > 
> > Does anybody have a strong feeling about requiring support for CREATE
> > TABLE for this?
> 
> Attached is a patch ontop of master implementing this syntax. It's not
> wired up to the changeset extraction patch yet as I am not sure whether
> others agree about the storage.

So, I am currently wondering about how to store the "old" tuple, based
on this. Currently it is stored using the TupleDesc of the index the old
tuple is based on. But if we want to allow transporting the entire tuple
that obviously cannot be the only option.
One option would be to change the stored format based on what's
configured, using the relation's TupleDesc if FULL is used. But I think
always using the heap relation's desc is better.
The not-logged columns would then just be represented as NULLs. That
will make old primary keys bigger if the relation has a high number of
columns and the key small, but I don't think it matters enough.

Opinions?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.4

From
Robert Haas
Date:
On Fri, Oct 25, 2013 at 10:58 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> So, I am currently wondering about how to store the "old" tuple, based
> on this. Currently it is stored using the TupleDesc of the index the old
> tuple is based on. But if we want to allow transporting the entire tuple
> that obviously cannot be the only option.
> One option would be to change the stored format based on what's
> configured, using the relation's TupleDesc if FULL is used. But I think
> always using the heap relation's desc is better.

I heartily agree.

> The not-logged columns would then just be represented as NULLs. That
> will make old primary keys bigger if the relation has a high number of
> columns and the key small, but I don't think it matters enough.

Even if it does matter, the cure seems likely to be worse than the disease.

My only other comment is that if NONE is selected, we ought to omit
the old tuple altogether, not store one that is all-nulls.  But I bet
you had that in mind anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.2

From
Robert Haas
Date:
On Fri, Oct 25, 2013 at 7:57 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> However, I'm leery about the idea of using a relation fork for this.
>> I'm not sure whether that's what you had it mind, but it gives me the
>> willies.  First, it adds distributed overhead to the system, as
>> previously discussed; and second, I think the accounting may be kind
>> of tricky, especially in the face of multiple rewrites.  I'd be more
>> inclined to find a separate place to store the mappings.  Note that,
>> AFAICS, there's no real need for the mapping file to be
>> block-structured, and I believe they'll be written first (with no
>> readers) and subsequently only read (with no further writes) and
>> eventually deleted.
>
> I was thinking of storing it along other data used during logical
> decoding and let decoding's cleanup clean up that data as well. All the
> information for that should be there.

That seems OK.

> There's one snag I currently can see, namely that we actually need to
> prevent that a formerly dropped relfilenode is getting reused. Not
> entirely sure what the best way for that is.

I'm not sure in detail, but it seems to me that this all part of the
same picture.  If you're tracking changed relfilenodes, you'd better
track dropped ones as well.  Completely aside from this issue, what
keeps a relation from being dropped before we've decoded all of the
changes made to its data before the point at which it was dropped?  (I
hope the answer isn't "nothing".)

>> One possible objection to this is that it would preclude decoding on a
>> standby, which seems like a likely enough thing to want to do.  So
>> maybe it's best to WAL-log the changes to the mapping file so that the
>> standby can reconstruct it if needed.
>
> The mapping file probably can be one big wal record, so it should be
> easy enough to do.

It might be better to batch it, because if you rewrite a big relation,
and the record is really big, everyone else will be frozen out of
inserting WAL for as long as that colossal record is being written and
synced.  If it's inserted in reasonably-sized chunks, the rest of the
system won't be starved as badly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.2

From
Robert Haas
Date:
On Fri, Oct 25, 2013 at 8:14 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> So, I thought about this for some more and I think I've a partial
> solution to the problem.
>
> The worst thing about deadlocks that occur in the above is that they
> could be the VACUUM FULL waiting for the "restart LSN"[1] of a decoding
> slot to progress, but the restart LSN cannot progress because the slot
> is waiting for a xid/transaction to end which is being blocked by the
> lock upgrade from VACUUM FULL. Such conflicts are not visible to the
> deadlock detector, which obviously is bad.
> I've prototyped this (~25 lines) and this happens pretty frequently. But
> it turns out that we can actually fix this by exporting (to shared
> memory) the oldest in-progress xid of a decoding slot. Then the waiting
> code can do a XactLockTableWait() for that xid...
>
> I wonder if this is isn't maybe sufficient. Yes, it can deadlock, but
> that's already the case for VACUUM FULLs of system tables, although less
> likely. And it will be detected/handled.
> There's one more snag though, we currently allow CLUSTER system_table;
> in an existing transaction. I think that'd have to be disallowed.

It wouldn't bother me too much to restrict CLUSTER system_table by
PreventTransactionChain() at wal_level = logical, but obviously it
would be nicer if we *didn't* have to do that.

In general, I don't think waiting on an XID is sufficient because a
process can acquire a heavyweight lock without having an XID.  Perhaps
use the VXID instead?

One thought I had about waiting for decoding to catch up is that you
might do it before acquiring the lock.  Of course, you then have a
problem if you get behind again before acquiring the lock.  It's
tempting to adopt the solution we used for RangeVarGetRelidExtended,
namely: wait for catchup without the lock, acquire the lock, see
whether we're still caught up if so great else release lock and loop.
But there's probably too much starvation risk to get away with that.

On the whole, I'm leaning toward thinking that the other solution
(recording the old-to-new CTID mappings generated by CLUSTER to the
extent that they are needed) is probably more elegant.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.2

From
Andres Freund
Date:
On 2013-10-28 12:04:01 -0400, Robert Haas wrote:
> On Fri, Oct 25, 2013 at 8:14 AM, Andres Freund <andres@2ndquadrant.com> wrote:

> > I wonder if this is isn't maybe sufficient. Yes, it can deadlock, but
> > that's already the case for VACUUM FULLs of system tables, although less
> > likely. And it will be detected/handled.
> > There's one more snag though, we currently allow CLUSTER system_table;
> > in an existing transaction. I think that'd have to be disallowed.
> 
> It wouldn't bother me too much to restrict CLUSTER system_table by
> PreventTransactionChain() at wal_level = logical, but obviously it
> would be nicer if we *didn't* have to do that.
> 
> In general, I don't think waiting on an XID is sufficient because a
> process can acquire a heavyweight lock without having an XID.  Perhaps
> use the VXID instead?

But decoding doesn't care about transactions that haven't "used" an XID
yet (since that means they haven't modified the catalog), so that
shouldn't be problematic.

> One thought I had about waiting for decoding to catch up is that you
> might do it before acquiring the lock.  Of course, you then have a
> problem if you get behind again before acquiring the lock.  It's
> tempting to adopt the solution we used for RangeVarGetRelidExtended,
> namely: wait for catchup without the lock, acquire the lock, see
> whether we're still caught up if so great else release lock and loop.
> But there's probably too much starvation risk to get away with that.

I think we'd pretty much always starve in that case. It'd be different
if we could detect that there weren't any writes to the table
inbetween. I can see doing that using a locking hack like autovac uses,
but brr, that'd be ugly.

> On the whole, I'm leaning toward thinking that the other solution
> (recording the old-to-new CTID mappings generated by CLUSTER to the
> extent that they are needed) is probably more elegant.

I personally still think that the "wide cmin/cmax" solution is *much*
more elegant, simpler and actually can be used for other things than
logical decoding.
Since you don't seem to agree I am going to write a prototype using such
a mapping to see how it will look though.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.2

From
Robert Haas
Date:
On Mon, Oct 28, 2013 at 12:17 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> In general, I don't think waiting on an XID is sufficient because a
>> process can acquire a heavyweight lock without having an XID.  Perhaps
>> use the VXID instead?
>
> But decoding doesn't care about transactions that haven't "used" an XID
> yet (since that means they haven't modified the catalog), so that
> shouldn't be problematic.

Hmm, maybe.  But what if the deadlock has more members?  e.g. A is
blocking decoding by holding AEL w/no XID, and B is blocking A by
doing VF on a rel A needs, and decoding is blocking B.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.2

From
Andres Freund
Date:
On 2013-10-28 11:54:31 -0400, Robert Haas wrote:
> > There's one snag I currently can see, namely that we actually need to
> > prevent that a formerly dropped relfilenode is getting reused. Not
> > entirely sure what the best way for that is.
> 
> I'm not sure in detail, but it seems to me that this all part of the
> same picture.  If you're tracking changed relfilenodes, you'd better
> track dropped ones as well.

What I am thinking about is the way GetNewRelFileNode() checks for
preexisting relfilenodes. It uses SnapshotDirty to scan for existing
relfilenodes for a newly created oid. Which means already dropped
relations could be reused.
I guess it could be as simple as using SatisfiesAny (or even better a
wrapper around SatisfiesVacuum that knows about recently dead tuples).

> Completely aside from this issue, what
> keeps a relation from being dropped before we've decoded all of the
> changes made to its data before the point at which it was dropped?  (I
> hope the answer isn't "nothing".)

Nothing. But there's no need to prevent it, it'll still be in the
catalog and we don't ever access a non-catalog relation's data during
decoding.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.2

From
Robert Haas
Date:
On Tue, Oct 29, 2013 at 10:47 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-10-28 11:54:31 -0400, Robert Haas wrote:
>> > There's one snag I currently can see, namely that we actually need to
>> > prevent that a formerly dropped relfilenode is getting reused. Not
>> > entirely sure what the best way for that is.
>>
>> I'm not sure in detail, but it seems to me that this all part of the
>> same picture.  If you're tracking changed relfilenodes, you'd better
>> track dropped ones as well.
>
> What I am thinking about is the way GetNewRelFileNode() checks for
> preexisting relfilenodes. It uses SnapshotDirty to scan for existing
> relfilenodes for a newly created oid. Which means already dropped
> relations could be reused.
> I guess it could be as simple as using SatisfiesAny (or even better a
> wrapper around SatisfiesVacuum that knows about recently dead tuples).

I think modifying GetNewRelFileNode() is attacking the problem from
the wrong end.  The point is that when a table is dropped, that fact
can be communicated to the same machine machinery that's been tracking
the CTID->CTID mappings.  Instead of saying "hey, the tuples that were
in relfilenode 12345 are now in relfilenode 67890 in these new
positions", it can say "hey, the tuples that were in relfilenode 12345
are now GONE".

>> Completely aside from this issue, what
>> keeps a relation from being dropped before we've decoded all of the
>> changes made to its data before the point at which it was dropped?  (I
>> hope the answer isn't "nothing".)
>
> Nothing. But there's no need to prevent it, it'll still be in the
> catalog and we don't ever access a non-catalog relation's data during
> decoding.

Oh, right.  But what about a drop of a user-catalog table?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.2

From
Andres Freund
Date:
On 2013-10-29 11:28:44 -0400, Robert Haas wrote:
> On Tue, Oct 29, 2013 at 10:47 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2013-10-28 11:54:31 -0400, Robert Haas wrote:
> >> > There's one snag I currently can see, namely that we actually need to
> >> > prevent that a formerly dropped relfilenode is getting reused. Not
> >> > entirely sure what the best way for that is.
> >>
> >> I'm not sure in detail, but it seems to me that this all part of the
> >> same picture.  If you're tracking changed relfilenodes, you'd better
> >> track dropped ones as well.
> >
> > What I am thinking about is the way GetNewRelFileNode() checks for
> > preexisting relfilenodes. It uses SnapshotDirty to scan for existing
> > relfilenodes for a newly created oid. Which means already dropped
> > relations could be reused.
> > I guess it could be as simple as using SatisfiesAny (or even better a
> > wrapper around SatisfiesVacuum that knows about recently dead tuples).
>
> I think modifying GetNewRelFileNode() is attacking the problem from
> the wrong end.  The point is that when a table is dropped, that fact
> can be communicated to the same machine machinery that's been tracking
> the CTID->CTID mappings.  Instead of saying "hey, the tuples that were
> in relfilenode 12345 are now in relfilenode 67890 in these new
> positions", it can say "hey, the tuples that were in relfilenode 12345
> are now GONE".

Unfortunately I don't understand what you're suggesting. What I am
worried about is something like:

<- decoding is here
VACUUM FULL pg_class; -- rewrites filenode 1 to 2
VACUUM FULL pg_class; -- rewrites filenode 2 to 3
VACUUM FULL pg_class; -- rewrites filenode 3 to 1
<- now decode up to here

In this case there are two possible (cmin,cmax) values for a specific
tuple. One from the original filenode 1 and one for the one generated
from 3.
Now that will only happen if there's an oid wraparound which hopefully
shouldn't happen very often, but I'd like to not rely on that.

> >> Completely aside from this issue, what
> >> keeps a relation from being dropped before we've decoded all of the
> >> changes made to its data before the point at which it was dropped?  (I
> >> hope the answer isn't "nothing".)
> >
> > Nothing. But there's no need to prevent it, it'll still be in the
> > catalog and we don't ever access a non-catalog relation's data during
> > decoding.
>
> Oh, right.  But what about a drop of a user-catalog table?

Currently nothing prevents that. I am not sure it's worth worrying about
it, do you think we should?

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.2

From
Robert Haas
Date:
On Tue, Oct 29, 2013 at 11:43 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> I think modifying GetNewRelFileNode() is attacking the problem from
>> the wrong end.  The point is that when a table is dropped, that fact
>> can be communicated to the same machine machinery that's been tracking
>> the CTID->CTID mappings.  Instead of saying "hey, the tuples that were
>> in relfilenode 12345 are now in relfilenode 67890 in these new
>> positions", it can say "hey, the tuples that were in relfilenode 12345
>> are now GONE".
>
> Unfortunately I don't understand what you're suggesting. What I am
> worried about is something like:
>
> <- decoding is here
> VACUUM FULL pg_class; -- rewrites filenode 1 to 2
> VACUUM FULL pg_class; -- rewrites filenode 2 to 3
> VACUUM FULL pg_class; -- rewrites filenode 3 to 1
> <- now decode up to here
>
> In this case there are two possible (cmin,cmax) values for a specific
> tuple. One from the original filenode 1 and one for the one generated
> from 3.
> Now that will only happen if there's an oid wraparound which hopefully
> shouldn't happen very often, but I'd like to not rely on that.

Ah, OK.  I didn't properly understand the scenario you were concerned
about.  There's only a potential problem here if we get behind by more
than 4 billion relfilenodes, which seems remote, but maybe not:

http://www.pgcon.org/2013/schedule/events/595.en.html

This still seems to me to be basically an accounting problem.  At any
given time, we should *know* where the catalog tuples are located.  We
can't be decoding changes that require a given system catalog while
that system catalog is locked, so any given decoding operation happens
either before or after, not during, the rewrite of the corresponding
catalog.  As long as that VACUUM FULL operation is responsible for
updating the logical decoding metadata, we should be fine.  Any
relcache entries referencing the old relfilenode need to be
invalidated, and any CTID->[cmin,cmax] maps we're storing for those
old relfilenodes need to be invalidated, too.

>> >> Completely aside from this issue, what
>> >> keeps a relation from being dropped before we've decoded all of the
>> >> changes made to its data before the point at which it was dropped?  (I
>> >> hope the answer isn't "nothing".)
>> >
>> > Nothing. But there's no need to prevent it, it'll still be in the
>> > catalog and we don't ever access a non-catalog relation's data during
>> > decoding.
>>
>> Oh, right.  But what about a drop of a user-catalog table?
>
> Currently nothing prevents that. I am not sure it's worth worrying about
> it, do you think we should?

Maybe.  Depends partly on how ugly things get if it happens, I suppose.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.5

From
Andres Freund
Date:
Hi,

Attached to this mail and in the xlog-decoding-rebasing-remapping branch
in my git[1] repository you can find the next version of the patchset that:
* Fixes full table rewrites of catalog tables using the method Robert
  prefers (which is to log rewrite mappings to disk)
* Extract the REPLICA IDENTITY as configured with ALTER TABLE for the
  old tuple for UPDATEs and DELETEs
* Much better support for synchronous replication
* Better resource cleanup (as in we need less local WAL available)
* Lots of smaller fixes

The change around REPLICA IDENTITY is *incompatible* to older output
plugins since we now log tuples using the table's TupleDesc, not the
indexes.

Robert, I'd be very grateful if you could have a look at patch 0007
implementing what we've discussed. I kept it separate to make it easier
to look at it in isolation, but I think in the end it partially should
be merged into the wal_level=logical patch.
I still think the "wide cmin/cmax" solution is more elegant and has
wider applicability, but this works as well although it's about 5 times
the code.

Comments?

[1]: http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=summary
Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: logical changeset generation v6.5

From
Robert Haas
Date:
On Tue, Nov 5, 2013 at 10:21 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> Attached to this mail and in the xlog-decoding-rebasing-remapping branch
> in my git[1] repository you can find the next version of the patchset that:

I have pushed patches #1 and #2 from this series as a single commit,
after some editing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.5

From
Robert Haas
Date:
On Fri, Nov 8, 2013 at 12:38 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Nov 5, 2013 at 10:21 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> Attached to this mail and in the xlog-decoding-rebasing-remapping branch
>> in my git[1] repository you can find the next version of the patchset that:
>
> I have pushed patches #1 and #2 from this series as a single commit,
> after some editing.

And I've also pushed patch #13, which is an almost-totally-unrelated
improvement that has nothing to do with logical replication, but is
useful all the same.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.5

From
Peter Eisentraut
Date:
On 11/8/13, 3:03 PM, Robert Haas wrote:
> On Fri, Nov 8, 2013 at 12:38 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Tue, Nov 5, 2013 at 10:21 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>>> Attached to this mail and in the xlog-decoding-rebasing-remapping branch
>>> in my git[1] repository you can find the next version of the patchset that:
>>
>> I have pushed patches #1 and #2 from this series as a single commit,
>> after some editing.
> 
> And I've also pushed patch #13, which is an almost-totally-unrelated
> improvement that has nothing to do with logical replication, but is
> useful all the same.

Please fix this new compiler warning:

pg_regress_ecpg.c: In function ‘main’:
pg_regress_ecpg.c:170:2: warning: passing argument 3 of ‘regression_main’ from incompatible pointer type [enabled by
default]
In file included from pg_regress_ecpg.c:19:0:
../../../../src/test/regress/pg_regress.h:55:5: note: expected ‘init_function’ but argument is of type ‘void
(*)(void)’



Re: logical changeset generation v6.5

From
Andres Freund
Date:
On 2013-11-08 17:11:58 -0500, Peter Eisentraut wrote:
> On 11/8/13, 3:03 PM, Robert Haas wrote:
> > On Fri, Nov 8, 2013 at 12:38 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> On Tue, Nov 5, 2013 at 10:21 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> >>> Attached to this mail and in the xlog-decoding-rebasing-remapping branch
> >>> in my git[1] repository you can find the next version of the patchset that:
> >>
> >> I have pushed patches #1 and #2 from this series as a single commit,
> >> after some editing.
> >
> > And I've also pushed patch #13, which is an almost-totally-unrelated
> > improvement that has nothing to do with logical replication, but is
> > useful all the same.
>
> Please fix this new compiler warning:
>
> pg_regress_ecpg.c: In function ‘main’:
> pg_regress_ecpg.c:170:2: warning: passing argument 3 of ‘regression_main’ from incompatible pointer type [enabled by
default]
> In file included from pg_regress_ecpg.c:19:0:
> ../../../../src/test/regress/pg_regress.h:55:5: note: expected ‘init_function’ but argument is of type ‘void
(*)(void)’

Hrmpf...

I usually run something akin to
# make -j3 -s && (cd contrib && make -j3 -s)
and then in a separate step
# make -s check-world
this is so I see compiler warnings before drowning them in check-world's
output. But ecpg/test isn't built during make in src/interfaces/ecpg,
but just during make check there.

ISTM ecpg's regression tests should be built (not run!) during
$(recurse) not just during make check. Patch towards that end attached.

Also attached is the fix for the compilation warning itself.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: logical changeset generation v6.5

From
Steve Singer
Date:
On 11/05/2013 10:21 AM, Andres Freund wrote:
> Hi,
>
> Attached to this mail and in the xlog-decoding-rebasing-remapping branch
> in my git[1] repository you can find the next version of the patchset that:
> * Fixes full table rewrites of catalog tables using the method Robert
>    prefers (which is to log rewrite mappings to disk)
> * Extract the REPLICA IDENTITY as configured with ALTER TABLE for the
>    old tuple for UPDATEs and DELETEs
> * Much better support for synchronous replication
> * Better resource cleanup (as in we need less local WAL available)
> * Lots of smaller fixes
> The change around REPLICA IDENTITY is *incompatible* to older output
> plugins since we now log tuples using the table's TupleDesc, not the
> indexes.

My updated plugin is getting rows with
change->tp.oldtuple as NULL on updates either with the default PRIMARY 
KEY identify or with a  FULL identity.

When I try the test_decoding plugin on UPDATE I get rows like:

table "do_inventory": UPDATE: ii_id[int8]:251 ii_in_stock[int8]:1 
ii_reserved[int8]:144 ii_total_sold[int8]:911

which I think is only data from the new tuple.    The lack of "old-key" 
in the output makes me think the test decoding plugin also isn't getting 
the old tuple.

(This is with your patch-set rebased ontop of 
ac4ab97ec05ea900db0f14d428cae2e79832e02d which includes the patches 
Robert committed the other day, I can't rule out that I didn't break 
something in the rebase).





> Robert, I'd be very grateful if you could have a look at patch 0007
> implementing what we've discussed. I kept it separate to make it easier
> to look at it in isolation, but I think in the end it partially should
> be merged into the wal_level=logical patch.
> I still think the "wide cmin/cmax" solution is more elegant and has
> wider applicability, but this works as well although it's about 5 times
> the code.
>
> Comments?
>
> [1]: http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=summary
> Greetings,
>
> Andres Freund
>
>
>




Re: logical changeset generation v6.5

From
Andres Freund
Date:
On 2013-11-09 17:36:49 -0500, Steve Singer wrote:
> On 11/05/2013 10:21 AM, Andres Freund wrote:
> >Hi,
> >
> >Attached to this mail and in the xlog-decoding-rebasing-remapping branch
> >in my git[1] repository you can find the next version of the patchset that:
> >* Fixes full table rewrites of catalog tables using the method Robert
> >   prefers (which is to log rewrite mappings to disk)
> >* Extract the REPLICA IDENTITY as configured with ALTER TABLE for the
> >   old tuple for UPDATEs and DELETEs
> >* Much better support for synchronous replication
> >* Better resource cleanup (as in we need less local WAL available)
> >* Lots of smaller fixes
> >The change around REPLICA IDENTITY is *incompatible* to older output
> >plugins since we now log tuples using the table's TupleDesc, not the
> >indexes.
> 
> My updated plugin is getting rows with
> change->tp.oldtuple as NULL on updates either with the default PRIMARY KEY
> identify or with a  FULL identity.
> 
> When I try the test_decoding plugin on UPDATE I get rows like:
> 
> table "do_inventory": UPDATE: ii_id[int8]:251 ii_in_stock[int8]:1
> ii_reserved[int8]:144 ii_total_sold[int8]:911
> 
> which I think is only data from the new tuple.    The lack of "old-key" in
> the output makes me think the test decoding plugin also isn't getting the
> old tuple.
> 
> (This is with your patch-set rebased ontop of
> ac4ab97ec05ea900db0f14d428cae2e79832e02d which includes the patches Robert
> committed the other day, I can't rule out that I didn't break something in
> the rebase).

I've pushed an updated tree to git, that contains that

http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-decoding-rebasing-remapping
git://git.postgresql.org/git/users/andresfreund/postgres.git

and some more fixes. I'll send out an email with details sometime soon.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.5

From
Steve Singer
Date:
On 11/09/2013 05:42 PM, Andres Freund wrote:
> On 2013-11-09 17:36:49 -0500, Steve Singer wrote:
>> On 11/05/2013 10:21 AM, Andres Freund wrote:
>>> Hi,
>>>
>>> Attached to this mail and in the xlog-decoding-rebasing-remapping branch
>>> in my git[1] repository you can find the next version of the patchset that:
>>> * Fixes full table rewrites of catalog tables using the method Robert
>>>    prefers (which is to log rewrite mappings to disk)
>>> * Extract the REPLICA IDENTITY as configured with ALTER TABLE for the
>>>    old tuple for UPDATEs and DELETEs
>>> * Much better support for synchronous replication
>>> * Better resource cleanup (as in we need less local WAL available)
>>> * Lots of smaller fixes
>>> The change around REPLICA IDENTITY is *incompatible* to older output
>>> plugins since we now log tuples using the table's TupleDesc, not the
>>> indexes.
>> My updated plugin is getting rows with
>> change->tp.oldtuple as NULL on updates either with the default PRIMARY KEY
>> identify or with a  FULL identity.
>>
>> When I try the test_decoding plugin on UPDATE I get rows like:
>>
>> table "do_inventory": UPDATE: ii_id[int8]:251 ii_in_stock[int8]:1
>> ii_reserved[int8]:144 ii_total_sold[int8]:911
>>
>> which I think is only data from the new tuple.    The lack of "old-key" in
>> the output makes me think the test decoding plugin also isn't getting the
>> old tuple.
>>
>> (This is with your patch-set rebased ontop of
>> ac4ab97ec05ea900db0f14d428cae2e79832e02d which includes the patches Robert
>> committed the other day, I can't rule out that I didn't break something in
>> the rebase).
> I've pushed an updated tree to git, that contains that
>
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-decoding-rebasing-remapping
> git://git.postgresql.org/git/users/andresfreund/postgres.git
>
> and some more fixes. I'll send out an email with details sometime soon.

93c5c2a171455763995cef0afa907bcfaa405db4

Still give me the following:
update  disorder.do_inventory set ii_in_stock=2 where ii_id=251;
UPDATE 1
test1=# LOG:  tuple in table with oid: 35122 without primary key

\d disorder.do_inventory   Table "disorder.do_inventory"    Column     |  Type  | Modifiers
---------------+--------+----------- ii_id         | bigint | not null ii_in_stock   | bigint | ii_reserved   | bigint
|ii_total_sold | bigint |
 
Indexes:    "do_inventory_pkey" PRIMARY KEY, btree (ii_id)
Foreign-key constraints:    "do_inventory_item_ref" FOREIGN KEY (ii_id) REFERENCES 
disorder.do_item(i_id) ON DELETE CASCADE
Referenced by:    TABLE "disorder.do_item" CONSTRAINT "do_item_inventory_ref" FOREIGN 
KEY (i_id) REFERENCES disorder.do_inventory(ii_id) DEFERRABLE INITIALLY 
DEFERRED    TABLE "disorder.do_restock" CONSTRAINT "do_restock_inventory_ref" 
FOREIGN KEY (r_i_id) REFERENCES disorder.do_inventory(ii_id) ON DELETE 
CASCADE
Triggers:    _disorder_replica_truncatetrigger BEFORE TRUNCATE ON 
disorder.do_inventory FOR EACH STATEMENT EXECUTE PROCEDURE 
_disorder_replica.log_truncate('3')
Disabled triggers:    _disorder_replica_denyaccess BEFORE INSERT OR DELETE OR UPDATE ON 
disorder.do_inventory FOR EACH ROW EXECUTE PROCEDURE 
_disorder_replica.denyaccess('_disorder_replica')    _disorder_replica_truncatedeny BEFORE TRUNCATE ON 
disorder.do_inventory FOR EACH STATEMENT EXECUTE PROCEDURE 
_disorder_replica.deny_truncate()
Replica Identity: FULL


The test decoder plugin gives me:

table "do_inventory": UPDATE: old-pkey:


a) The table does have a primary key
b) I don't get anything in the old key when I was expecting all the rows
c)  If I change the table to use the pkey index with
alter table disorder.do_inventory  replica identity using index 
do_inventory_pkey;

The LOG message on the update goes away but the output of the test 
decoder plugin goes back to

table "do_inventory": UPDATE: ii_id[int8]:251 ii_in_stock[int8]:5 
ii_reserved[int8]:144 ii_total_sold[int8]:911

Which I suspect means oldtuple is back to null



> Greetings,
>
> Andres Freund
>




Re: logical changeset generation v6.5

From
Andres Freund
Date:
On 2013-11-09 20:16:20 -0500, Steve Singer wrote: >>When I try the test_decoding plugin on UPDATE I get rows like:
> >>
> >>table "do_inventory": UPDATE: ii_id[int8]:251 ii_in_stock[int8]:1
> >>ii_reserved[int8]:144 ii_total_sold[int8]:911
> >>
> >>which I think is only data from the new tuple.    The lack of "old-key" in
> >>the output makes me think the test decoding plugin also isn't getting the
> >>old tuple.
> >>
> >>(This is with your patch-set rebased ontop of
> >>ac4ab97ec05ea900db0f14d428cae2e79832e02d which includes the patches Robert
> >>committed the other day, I can't rule out that I didn't break something in
> >>the rebase).
> >I've pushed an updated tree to git, that contains that
>
>http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-decoding-rebasing-remapping
> >git://git.postgresql.org/git/users/andresfreund/postgres.git
> >
> >and some more fixes. I'll send out an email with details sometime soon.
>
> 93c5c2a171455763995cef0afa907bcfaa405db4
>
> Still give me the following:
> update  disorder.do_inventory set ii_in_stock=2 where ii_id=251;
> UPDATE 1
> test1=# LOG:  tuple in table with oid: 35122 without primary key

Hm. Could it be that you still have an older "test_decoding" plugin
lying around? The current one doesn't contain that string
anymore. That'd explain the problems.
In v6.4 the output plugin API was changed that plain heaptuples are
passed for the "old" key, although with non-key columns set to
NULL. Earlier it was a "index tuple" as defined by the indexes
TupleDesc.

> a) The table does have a primary key
> b) I don't get anything in the old key when I was expecting all the rows
> c)  If I change the table to use the pkey index with
> alter table disorder.do_inventory  replica identity using index
> do_inventory_pkey;
>
> The LOG message on the update goes away but the output of the test decoder
> plugin goes back to
>
> table "do_inventory": UPDATE: ii_id[int8]:251 ii_in_stock[int8]:5
> ii_reserved[int8]:144 ii_total_sold[int8]:911
>
> Which I suspect means oldtuple is back to null

Which is legitimate though, if you don't update the primary (or
explicitly chosen candidate) key. Those only get logged if there's
actual changes in those columns.
Makes sense?

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.5

From
Steve Singer
Date:
On 11/10/2013 09:41 AM, Andres Freund wrote:
> Still give me the following:
> update  disorder.do_inventory set ii_in_stock=2 where ii_id=251;
> UPDATE 1
> test1=# LOG:  tuple in table with oid: 35122 without primary key
> Hm. Could it be that you still have an older "test_decoding" plugin
> lying around? The current one doesn't contain that string
> anymore. That'd explain the problems.
> In v6.4 the output plugin API was changed that plain heaptuples are
> passed for the "old" key, although with non-key columns set to
> NULL. Earlier it was a "index tuple" as defined by the indexes
> TupleDesc.

Grrr, yah that was the problem I had compiled but not installed the 
newer plugin. Sorry.


>> a) The table does have a primary key
>> b) I don't get anything in the old key when I was expecting all the rows
>> c)  If I change the table to use the pkey index with
>> alter table disorder.do_inventory  replica identity using index
>> do_inventory_pkey;
>>
>> The LOG message on the update goes away but the output of the test decoder
>> plugin goes back to
>>
>> table "do_inventory": UPDATE: ii_id[int8]:251 ii_in_stock[int8]:5
>> ii_reserved[int8]:144 ii_total_sold[int8]:911
>>
>> Which I suspect means oldtuple is back to null
> Which is legitimate though, if you don't update the primary (or
> explicitly chosen candidate) key. Those only get logged if there's
> actual changes in those columns.
> Makes sense?
Is the expectation that plugin writters will call
RelationGetIndexAttrBitmap(relation,INDEX_ATTR_BITMAP_IDENTITY_KEY);
to figure out what the identity key is.

How do we feel about having the decoder logic populate change.oldtuple 
with the identity
on UPDATE statements when it is null?  The logic I have now is  to use 
oldtuple if it is not null, otherwise go figure out which columns from 
the identiy key we should be using.   I think most plugins that do 
anything useful with an update will need to duplicate that







> Greetings,
>
> Andres Freund
>
> --
>   Andres Freund                       http://www.2ndQuadrant.com/
>   PostgreSQL Development, 24x7 Support, Training & Services
>
>




Re: logical changeset generation v6.6

From
Andres Freund
Date:
Hi,

Changes since last version:
* fixes around the logging of toasted columns for the REPLICA IDENTITY
  in UPDATE/DELETE. Found due to a question of Robert's.
* Initial documentation for the additional wal_level, but that will
  require additional links once further patches of the series are committed.
* Comment, elog/ereport, indentation impovements in many of the patches
* Add the isolationtester tests to "make check"
  contrib/test_logical_decoding and introduce "installcheck-force" that
  forces an installcheck run even though it requires special
  configuration parameters.
* the heap rewrite checkpoint code now skips over files not named
  "map-*" instead of complaining if it cannot sscanf() the filename.
* pg_stat_logical_decoding system view: 'renamed' the numeric 'database'
  column in 'dboid' and added a join to pg_database
* Remove several FIXMEs by implementing support for dropping data of
  transactions that were running before a crash.
* Add CRC32 to snapbuild state files

Questions:
* Should we rename (INIT|START|FREE)_LOGICAL_REPLICATION into
  *_LOGICAL_DECODING?
* Should we rename FREE_LOGICAL_REPLICATION into
  STOP_LOGICAL_REPLICATION? stop_logical_replication() currently is the
  SQL level function...

Todo:
* Implement timeline handling. We need to switch timelines when extracting
  changes on a standby. I think we need to readTimeLineHistory() and
  then liOfPointInHistory() for every segment.
* Once guc and recovery.conf are merged, we might want to support using
  recovery_command to gather older wal files.

I am starting to be rather happy with the state of the patch.

01 wal_decoding: Add wal_level = logical and log data required for logical decoding
02 wal_decoding: Log xl_running_xact's at a higher frequency than checkpoints are done
03 wal_decoding: Add option to use user defined tables as catalog tables
04 wal_decoding: Introduce wal decoding via catalog timetravel
05 wal_decoding: Implement VACUUM FULL/CLUSTER support via rewrite maps
   * should probably be merged with 04, kept separate for review
06 wal_decoding: Only peg the xmin horizon for catalog tables during logical decoding
07 wal_decoding: Allow walsender's to connect to a specific database
08 wal_decoding: logical changeset extraction walsender interface
09 wal_decoding: test_decoding: Add a simple decoding module in contrib
10 wal_decoding: pg_recvlogical: Introduce pg_receivexlog equivalent for logical changes
11 wal_decoding: test_logical_decoding: Add extension for easier testing of logical decoding
12 wal_decoding: design document v2.4 and snapshot building design doc v0.5
13 wal_decoding: Temporarily add logical decoding regression tests to everything
   * shouldn't be committed, but it's useful for testing

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: logical changeset generation v6.5

From
Andres Freund
Date:
On 2013-11-10 14:45:17 -0500, Steve Singer wrote:
> On 11/10/2013 09:41 AM, Andres Freund wrote:
> >Still give me the following:
> >update  disorder.do_inventory set ii_in_stock=2 where ii_id=251;
> >UPDATE 1
> >test1=# LOG:  tuple in table with oid: 35122 without primary key
> >Hm. Could it be that you still have an older "test_decoding" plugin
> >lying around? The current one doesn't contain that string
> >anymore. That'd explain the problems.
> >In v6.4 the output plugin API was changed that plain heaptuples are
> >passed for the "old" key, although with non-key columns set to
> >NULL. Earlier it was a "index tuple" as defined by the indexes
> >TupleDesc.
> 
> Grrr, yah that was the problem I had compiled but not installed the newer
> plugin. Sorry.

Heh, happened to me several times during development ;)

> >>Which I suspect means oldtuple is back to null
> >Which is legitimate though, if you don't update the primary (or
> >explicitly chosen candidate) key. Those only get logged if there's
> >actual changes in those columns.
> >Makes sense?
> Is the expectation that plugin writters will call
> RelationGetIndexAttrBitmap(relation,INDEX_ATTR_BITMAP_IDENTITY_KEY);
> to figure out what the identity key is.

I'd expect them to check whether relreplident is FULL, NOTHING or
DEFAULT|INDEX. In the latter case they can check
Relation->rd_replidindex. The bitmap doesn't really seem to be helpful?

> How do we feel about having the decoder logic populate change.oldtuple with
> the identity  on UPDATE statements when it is null?

Not really keen - that'd be a noticeable overhead. Note that in the
cases where DEFAULT|INDEX is used, you can just use the new tuple to
extract what you need for the pkey lookup since they now have the same
format and since it's guaranteed that the relevant columns haven't
changed if oldtup is null and there's a key.

What are you actually doing with those columns? Populating a WHERE
clause?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.6

From
Robert Haas
Date:
On Mon, Nov 11, 2013 at 12:00 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> [ updated patch-set ]

I'm pretty happy with what's now patch #1, f/k/a known as patch #3,
and probably somewhere else in the set before that.  At any rate, I
refer to 0001-wal_decoding-Add-wal_level-logical-and-log-data-requ.patch.gz.

I think the documentation still needs a bit of work.  It's not
particularly clean to just change all the places that refer to the
need to set wal_level to archive (or hot_standby) level to instead
refer to archive (or hot_standby, logical).  If we're going to do it
that way, I think we definitely need a connecting word between
hot_standby and logical, specifically "or".  But I'm wondering if
would be better to instead change those places to say archive (or any
higher setting).

You've actually changed the meaning of this section (and not in a good way):
        be set at server start. <varname>wal_level</> must be set
-        to <literal>archive</> or <literal>hot_standby</> to allow
-        connections from standby servers.
+        to <literal>archive</>, <literal>hot_standby</> or <literal>logical</>
+        to allow connections from standby servers.

I think that the previous text meant that you needed archive - or, if
you want to allow connections, hot_standby.  The new text loses that
nuance.

I'm tempted to think that we're better off calling this "logical
decoding" rather than "logical replication".  At least, we should
standardize on one or the other.  If we go with "decoding", then fix
these:

+                * For logical replication, we need the tuple even if
we're doing a
+/* Do we need to WAL-log information required only for Hot Standby
and logical replication? */
+/* Do we need to WAL-log information required only for logical replication? */
(and we should go back and correct the instance already added to the
ALTER TABLE documentation)

Is there any special reason why RelationIsLogicallyLogged(), which is
basically a three-pronged test, does one of those tests in a macro and
defers the other two to a function?  Why not just put it all in the
macro?

I did some performance testing on the previous iteration of this
patch, just my usual set of 30-minute pgbench runs.  I tried it with
wal_level=hot_standby and wal_level=logical.  32-clients, scale factor
300, shared_buffers = 8GB, maintenance_work_mem = 4GB,
synchronous_commit = off, checkpoint_segments = 300,
checkpoint_timeout = 15min, checkpoint_completion_target = 0.9.  The
results came out like this:

hot_standby tps = 15070.229005 (including connections establishing)
hot_standby tps = 14769.905054 (including connections establishing)
hot_standby tps = 15119.350014 (including connections establishing)
logical tps = 14713.523689 (including connections establishing)
logical tps = 14799.242855 (including connections establishing)
logical tps = 14557.538038 (including connections establishing)

The runs were interleaved, but I've shown them here grouped by the
wal_level in use.  If you compare the median values, there's about a
1% regression there with wal_level=logical, but that might not even be
significant - and if it is, well, that's why this feature has an off
switch.

-        * than its parent.  Musn't recurse here, or we might get a
stack overflow
+        * than its parent.  May not recurse here, or we might get a
stack overflow

You don't need this change; it doesn't change the meaning.

+        * with fewer than PGPROC_MAX_CACHED_SUBXIDS. Note that it is fine
+        * if didLogXid isn't set for a transaction even though it appears
+        * in a wal record, we'll just superfluously log something.

It'd be good to rewrite this comment to explain under what
circumstances that can happen, or why it can't happen but that it
would be OK if it did.

I think we'd better separate the changes to catalog.c from the rest of
this.  Those are changing semantics in a significant way that needs to
be separately called out.  In particular, a user-created table in
pg_catalog will be able to have indexes defined on it, will be able to
be truncated, will be allowed to have triggers, etc.  I think that's
OK, but it shouldn't be a by-blow of the rest of this patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.6

From
Andres Freund
Date:
Hi,

On 2013-11-12 12:13:54 -0500, Robert Haas wrote:
> On Mon, Nov 11, 2013 at 12:00 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > [ updated patch-set ]
>
> I'm pretty happy with what's now patch #1, f/k/a known as patch #3,
> and probably somewhere else in the set before that.  At any rate, I
> refer to 0001-wal_decoding-Add-wal_level-logical-and-log-data-requ.patch.gz.

Cool.

> I think the documentation still needs a bit of work.  It's not
> particularly clean to just change all the places that refer to the
> need to set wal_level to archive (or hot_standby) level to instead
> refer to archive (or hot_standby, logical).  If we're going to do it
> that way, I think we definitely need a connecting word between
> hot_standby and logical, specifically "or".

Hm. I tried to make it "archive, hot_standby or logical", but I see I
screwed up along the way.

> But I'm wondering if
> would be better to instead change those places to say archive (or any
> higher setting).

Works for me. We'd need to make sure there's a clear ordering
recognizable in at least one place, but that's a good idea anyway.

> You've actually changed the meaning of this section (and not in a good way):
>
>          be set at server start. <varname>wal_level</> must be set
> -        to <literal>archive</> or <literal>hot_standby</> to allow
> -        connections from standby servers.
> +        to <literal>archive</>, <literal>hot_standby</> or <literal>logical</>
> +        to allow connections from standby servers.
>
> I think that the previous text meant that you needed archive - or, if
> you want to allow connections, hot_standby.  The new text loses that
> nuance.

Yea, that's because it was lost on me in the first place...

> I'm tempted to think that we're better off calling this "logical
> decoding" rather than "logical replication".  At least, we should
> standardize on one or the other.  If we go with "decoding", then fix
> these:

I agree. It all used to be "logical replication" but this feature really
isn't about the replication, but about the extraction part.

>
> +                * For logical replication, we need the tuple even if
> we're doing a
> +/* Do we need to WAL-log information required only for Hot Standby
> and logical replication? */
> +/* Do we need to WAL-log information required only for logical replication? */
> (and we should go back and correct the instance already added to the
> ALTER TABLE documentation)
>
> Is there any special reason why RelationIsLogicallyLogged(), which is
> basically a three-pronged test, does one of those tests in a macro and
> defers the other two to a function?  Why not just put it all in the
> macro?

We could, I basically didn't want to add too much inlined code
everywhere when wal_level != logical, but the functions reduced in size
since.

> I did some performance testing on the previous iteration of this
> patch, just my usual set of 30-minute pgbench runs.  I tried it with
> wal_level=hot_standby and wal_level=logical.  32-clients, scale factor
> 300, shared_buffers = 8GB, maintenance_work_mem = 4GB,
> synchronous_commit = off, checkpoint_segments = 300,
> checkpoint_timeout = 15min, checkpoint_completion_target = 0.9.  The
> results came out like this:
>
> hot_standby tps = 15070.229005 (including connections establishing)
> hot_standby tps = 14769.905054 (including connections establishing)
> hot_standby tps = 15119.350014 (including connections establishing)
> logical tps = 14713.523689 (including connections establishing)
> logical tps = 14799.242855 (including connections establishing)
> logical tps = 14557.538038 (including connections establishing)
>
> The runs were interleaved, but I've shown them here grouped by the
> wal_level in use.  If you compare the median values, there's about a
> 1% regression there with wal_level=logical, but that might not even be
> significant - and if it is, well, that's why this feature has an off
> switch.

That matches my test and is imo pretty ok. The overhead is from a slight
increase in wal volume because during FPWs we do not just log the FPW
but also the tuples.
It will be worse if primary keys were changed regularly though.

> -        * than its parent.  Musn't recurse here, or we might get a
> stack overflow
> +        * than its parent.  May not recurse here, or we might get a
> stack overflow
>
> You don't need this change; it doesn't change the meaning.

I thought that "Musn't" was a typo, because of the missing t before the
n. But it obviously doesn't have to be part of this patch.

> +        * with fewer than PGPROC_MAX_CACHED_SUBXIDS. Note that it is fine
> +        * if didLogXid isn't set for a transaction even though it appears
> +        * in a wal record, we'll just superfluously log something.
>
> It'd be good to rewrite this comment to explain under what
> circumstances that can happen, or why it can't happen but that it
> would be OK if it did.

Ok.

> I think we'd better separate the changes to catalog.c from the rest of
> this.  Those are changing semantics in a significant way that needs to
> be separately called out.  In particular, a user-created table in
> pg_catalog will be able to have indexes defined on it, will be able to
> be truncated, will be allowed to have triggers, etc.  I think that's
> OK, but it shouldn't be a by-blow of the rest of this patch.

Completely agreed. As evidenced by the fact that the current change
doesn't update all relevant comments & code. I wonder if we shouldn't
leave the function the current way and just add a new function for the
new behaviour.
The hard thing with that would be coming up with a new
name. IsSystemRelationId() having a different behaviour than
IsSystemRelation() seems strange to me, so just keeping that and
adapting the callers seems wrong to me.
IsInternalRelation()? IsCatalogRelation()?

Thanks for the review,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.6

From
Robert Haas
Date:
On Tue, Nov 12, 2013 at 12:50 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> Completely agreed. As evidenced by the fact that the current change
> doesn't update all relevant comments & code. I wonder if we shouldn't
> leave the function the current way and just add a new function for the
> new behaviour.
> The hard thing with that would be coming up with a new
> name. IsSystemRelationId() having a different behaviour than
> IsSystemRelation() seems strange to me, so just keeping that and
> adapting the callers seems wrong to me.
> IsInternalRelation()? IsCatalogRelation()?

Well, I went through and looked at the places that were affected by
this and I tend to think that most places will be happier with the new
definition.  Picking one at random, consider the calls in cluster.c.
The first is used to set the is_system_catalog flag that is passed to
finish_heap_swap(), which controls whether we queue invalidation
messages after doing the CLUSTER.  Well, unless I'm quite mistaken,
user-defined relations in pg_catalog will not have catalog caches and
thus don't need invalidations.  The second call in that file is used
to decide whether to warn about inserts or deletes that appear to be
in progress on a table that we have x-locked; that should only apply
to "real" system catalogs, because other things we create in
pg_catalog won't have short-duration locks.  (Maybe the
user-catalog-tables patch will modify this test; I'm not sure, but if
this needs to work differently it seems that it should be conditional
on that, not what schema the table lives in.)

If there are call sites that want the existing test, maybe we should
have IsRelationInSystemNamespace() for that, and reserve
IsSystemRelation() for the test as to whether it's a bona fide system
catalog.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.6

From
Andres Freund
Date:
On 2013-11-12 13:18:19 -0500, Robert Haas wrote:
> On Tue, Nov 12, 2013 at 12:50 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > Completely agreed. As evidenced by the fact that the current change
> > doesn't update all relevant comments & code. I wonder if we shouldn't
> > leave the function the current way and just add a new function for the
> > new behaviour.
> > The hard thing with that would be coming up with a new
> > name. IsSystemRelationId() having a different behaviour than
> > IsSystemRelation() seems strange to me, so just keeping that and
> > adapting the callers seems wrong to me.
> > IsInternalRelation()? IsCatalogRelation()?
> 
> Well, I went through and looked at the places that were affected by
> this and I tend to think that most places will be happier with the new
> definition.

I agree that many if not most want the new definition.

> If there are call sites that want the existing test, maybe we should
> have IsRelationInSystemNamespace() for that, and reserve
> IsSystemRelation() for the test as to whether it's a bona fide system
> catalog.

The big reason that I think we do not want the new behaviour for all is:
*        NB: TOAST relations are considered system relations by this test*        for compatibility with the old
IsSystemRelationNamefunction.*        This is appropriate in many places but not all.  Where it's not,*        also
checkIsToastRelation.
 

the current state of things would allow to modify toast relations in
some places :/

I'd suggest renaming the current IsSystemRelation() to your
IsRelationInSystemNamespace() and add IsCatalogRelation() for the new
meaning, so we are sure to break old users.

Let me come up with something like that.

> (Maybe the
> user-catalog-tables patch will modify this test; I'm not sure, but if
> this needs to work differently it seems that it should be conditional
> on that, not what schema the table lives in.)

No, they shouldn't change that. We might want to allow such locking
semantics at some points, but that'd be a separate patch.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.5

From
Steve Singer
Date:
On 11/11/2013 02:06 PM, Andres Freund wrote:
> On 2013-11-10 14:45:17 -0500, Steve Singer wrote:
>
> Not really keen - that'd be a noticeable overhead. Note that in the
> cases where DEFAULT|INDEX is used, you can just use the new tuple to
> extract what you need for the pkey lookup since they now have the same
> format and since it's guaranteed that the relevant columns haven't
> changed if oldtup is null and there's a key.
>
> What are you actually doing with those columns? Populating a WHERE
> clause?

Yup building a WHERE clause

> Greetings,
>
> Andres Freund
>




Re: logical changeset generation v6.6

From
Andres Freund
Date:
On 2013-11-12 19:24:39 +0100, Andres Freund wrote:
> On 2013-11-12 13:18:19 -0500, Robert Haas wrote:
> > On Tue, Nov 12, 2013 at 12:50 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > > Completely agreed. As evidenced by the fact that the current change
> > > doesn't update all relevant comments & code. I wonder if we shouldn't
> > > leave the function the current way and just add a new function for the
> > > new behaviour.
> > > The hard thing with that would be coming up with a new
> > > name. IsSystemRelationId() having a different behaviour than
> > > IsSystemRelation() seems strange to me, so just keeping that and
> > > adapting the callers seems wrong to me.
> > > IsInternalRelation()? IsCatalogRelation()?
> >
> > Well, I went through and looked at the places that were affected by
> > this and I tend to think that most places will be happier with the new
> > definition.
>
> I agree that many if not most want the new definition.
>
> > If there are call sites that want the existing test, maybe we should
> > have IsRelationInSystemNamespace() for that, and reserve
> > IsSystemRelation() for the test as to whether it's a bona fide system
> > catalog.
>
> The big reason that I think we do not want the new behaviour for all is:
>
>  *        NB: TOAST relations are considered system relations by this test
>  *        for compatibility with the old IsSystemRelationName function.
>  *        This is appropriate in many places but not all.  Where it's not,
>  *        also check IsToastRelation.
>
> the current state of things would allow to modify toast relations in
> some places :/

So, I think I found a useful defintion of IsSystemRelation() that fixes
many of the issues with moving relations to pg_catalog: Continue to
treat all pg_toast.* relations as system tables, but only consider
initdb created relations in pg_class.
I've then added IsCatalogRelation() which has a narrower definition of
system relations, namely, it only counts toast tables if they are a
catalog's toast table.

This allows far more actions on user defined relations moved to
pg_catalog. Now they aren't stuck there anymore and can be renamed,
dropped et al. With one curious exception: We still cannot move a
relation out of pg_catalog.
I've included a hunk to allow creation of indexes on relations in
pg_catalog in heap_create(), indexes on catalog relations are prevented
way above, but maybe that should rather be a separate commit.

What do you think?

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: logical changeset generation v6.5

From
Peter Eisentraut
Date:
On 11/9/13, 5:56 AM, Andres Freund wrote:
> ISTM ecpg's regression tests should be built (not run!) during
> $(recurse) not just during make check.

Actually, I did just the opposite change some years ago.  The rationale
is, the build builds that which you want to install.



Re: logical changeset generation v6.7

From
Andres Freund
Date:
On 2013-11-12 18:50:33 +0100, Andres Freund wrote:
> > You've actually changed the meaning of this section (and not in a good way):
> >
> >          be set at server start. <varname>wal_level</> must be set
> > -        to <literal>archive</> or <literal>hot_standby</> to allow
> > -        connections from standby servers.
> > +        to <literal>archive</>, <literal>hot_standby</> or <literal>logical</>
> > +        to allow connections from standby servers.
> >
> > I think that the previous text meant that you needed archive - or, if
> > you want to allow connections, hot_standby.  The new text loses that
> > nuance.
>
> Yea, that's because it was lost on me in the first place...

I think that's because the nuance isn't actually in the text - note that
it is talking about max_wal_senders and talking about connections
*from*, not *to* standby servers.
I've reformulated the wal_level paragraph and used "or higher" in
several places now.

Ok, so here's a rebased version of this. I tried to fix all the issues
you mentioned, and it's based on the split off IsSystemRelation() patch,
I've sent yesterday (included here).

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: logical changeset generation v6.7

From
Fabrízio de Royes Mello
Date:



On Thu, Nov 14, 2013 at 11:46 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>
> On 2013-11-12 18:50:33 +0100, Andres Freund wrote:
> > > You've actually changed the meaning of this section (and not in a good way):
> > >
> > >          be set at server start. <varname>wal_level</> must be set
> > > -        to <literal>archive</> or <literal>hot_standby</> to allow
> > > -        connections from standby servers.
> > > +        to <literal>archive</>, <literal>hot_standby</> or <literal>logical</>
> > > +        to allow connections from standby servers.
> > >
> > > I think that the previous text meant that you needed archive - or, if
> > > you want to allow connections, hot_standby.  The new text loses that
> > > nuance.
> >
> > Yea, that's because it was lost on me in the first place...
>
> I think that's because the nuance isn't actually in the text - note that
> it is talking about max_wal_senders and talking about connections
> *from*, not *to* standby servers.
> I've reformulated the wal_level paragraph and used "or higher" in
> several places now.
>
> Ok, so here's a rebased version of this. I tried to fix all the issues
> you mentioned, and it's based on the split off IsSystemRelation() patch,
> I've sent yesterday (included here).
>

Hello,

I'm trying to apply the patches but show some warnings/errors:

$ gunzip -c /home/fabrizio/Downloads/0002-wal_decoding-Add-wal_level-logical-and-log-data-requ.patch.gz | git apply -
warning: src/backend/access/transam/xlog.c has type 100755, expected 100644

$ gunzip -c /home/fabrizio/Downloads/0005-wal_decoding-Introduce-wal-decoding-via-catalog-time.patch.gz | git apply -
warning: src/backend/access/transam/xlog.c has type 100755, expected 100644

$ gunzip -c /home/fabrizio/Downloads/0006-wal_decoding-Implement-VACUUM-FULL-CLUSTER-support-v.patch.gz | git apply -
warning: src/backend/access/transam/xlog.c has type 100755, expected 100644

$ gunzip -c /home/fabrizio/Downloads/0007-wal_decoding-Only-peg-the-xmin-horizon-for-catalog-t.patch.gz | git apply -
warning: src/backend/access/transam/xlog.c has type 100755, expected 100644

$ gunzip -c /home/fabrizio/Downloads/0011-wal_decoding-pg_recvlogical-Introduce-pg_receivexlog.patch.gz | git apply -
error: patch failed: src/bin/pg_basebackup/streamutil.c:210
error: src/bin/pg_basebackup/streamutil.c: patch does not apply

The others are applied correctly. The permission warning must be fixed and 0011 bust be rebased.

Regards,

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
>> Timbira: http://www.timbira.com.br
>> Blog sobre TI: http://fabriziomello.blogspot.com
>> Perfil Linkedin: http://br.linkedin.com/in/fabriziomello
>> Twitter: http://twitter.com/fabriziomello

Re: logical changeset generation v6.7

From
Robert Haas
Date:
On Thu, Nov 14, 2013 at 8:46 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-11-12 18:50:33 +0100, Andres Freund wrote:
>> > You've actually changed the meaning of this section (and not in a good way):
>> >
>> >          be set at server start. <varname>wal_level</> must be set
>> > -        to <literal>archive</> or <literal>hot_standby</> to allow
>> > -        connections from standby servers.
>> > +        to <literal>archive</>, <literal>hot_standby</> or <literal>logical</>
>> > +        to allow connections from standby servers.
>> >
>> > I think that the previous text meant that you needed archive - or, if
>> > you want to allow connections, hot_standby.  The new text loses that
>> > nuance.
>>
>> Yea, that's because it was lost on me in the first place...
>
> I think that's because the nuance isn't actually in the text - note that
> it is talking about max_wal_senders and talking about connections
> *from*, not *to* standby servers.
> I've reformulated the wal_level paragraph and used "or higher" in
> several places now.
>
> Ok, so here's a rebased version of this. I tried to fix all the issues
> you mentioned, and it's based on the split off IsSystemRelation() patch,
> I've sent yesterday (included here).

OK, I've committed the patch to adjust the definition of
IsSystemRelation()/IsSystemClass() and add
IsCatalogRelation()/IsCatalogClass().  I kibitzed your decision about
which function to use in a few places - specifically, I made all of
the places that cared about allow_system_table_mods uses the IsSystem
functions, and all the places that cared about invalidation messages
use the IsCatalog functions.  I don't think any of these changes are
more cosmetic, but I think it may reduce the chance of errors or
inconsistencies in the face of future changes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.7

From
Robert Haas
Date:
On Thu, Nov 14, 2013 at 8:46 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> [ new patches ]

Here's an updated version of patch #2.  I didn't really like the
approach you took in the documentation, so I revised it.

Apart from that, I spent a lot of time looking at
HeapSatisfiesHOTandKeyUpdate.  I'm not very happy with your changes.
The idea seems to be that we'll iterate through all of the HOT columns
regardless, but that might be very inefficient.  Suppose there are 100
HOT columns, the last one is the only key column, and only the first
one has been modified.  Once we look at #1 and determine that it's not
HOT, we should zoom forward and skip over the next 98, and only look
at the last one; your version does not behave like that.

I think there's also some confusion in your version about what ends up
in the attnum values: they're normally adjusted by
FirstLowInvalidHeapAttributeNumber, but when bms_first_member returns
-1 then they're not.  But that's not a great thing, because -1 is
actually a valid attribute number.  I've taken a crack at rewriting
this logic, and the result looks cleaner and simpler to me, but I
haven't tested it beyond the fact that it passes make check.  See what
you think.

I haven't completely reviewed every bit of this in depth yet, but it's
1:15am, so I'm going to post what I have and throw in the towel for
tonight.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: logical changeset generation v6.7

From
Kyotaro HORIGUCHI
Date:
Hello, This is rather trivial and superficial comments as not
fully gripping functions of this patchset.

- Some patches have line offset to master. Rebase needed.

Other random comments follows,

===== 0001:
- You assined HeapTupleGetOid(tuple) into relid to read in  several points but no modification. Nevertheless, you left
HeapTupleGetOidnot replaced there. I think 'relid' just for  repeated reading has far small merit compared to demerit
of lowering readability. You'd be better to make them uniform in  either way.
 

  
===== 0002:
- You are identifying the wal_level with the expr 'wal_level >=  WAL_LEVEL_LOGICAL' but it seems somewhat improper.
- In heap_insert, you added following comment and code,
  >     * Also, if this is a catalog, we need to transmit combocids to  >     * properly decode, so log that as well.
>    */  >    need_tuple_data = RelationIsLogicallyLogged(relation);  >    if
(RelationIsAccessibleInLogicalDecoding(relation)) >        log_heap_new_cid(relation, heaptup);
 
  Actually 'is a catalog' is checkied in  RelationIsAcc...Decodeing() but this either of naming or  commnet should be
changedfor consistent look. (I this the  name of the macro is rather long but gives a vague  illustration of the
function..)

- RelationIsAccessibleInLogicalDecoding and  RelationIsLogicallyLogged are identical in code. Together with  the name
ambiguity,this is quite confising and cause of  future misuse between these macros, I suppose. Plus the names  seem too
long.
- In heap_insert, the information conveyed on rdata[3] seems to  be better to be in rdata[2] because of the scarecity
ofthe  additional information. XLOG_HEAP_CONTAINS_NEW_TUPLE only  seems to be needed. Also is in heap_multi_insert and
heap_upate.
- In heap_multi_insert, need_cids referred only once so might be  better removed.
- In heap_delete, at the point adding replica identity, same to  the aboves, rdata[3] could be moved into rdata[2]
makingnew  type like 'xl_heap_replident'.
 
- In heapam_xlog.h, the new type xl_heap_header_len is  inadequate in both of naming which is confising and
constructionon which the header in xl_heap_header is no  longer be a header and indecisive member name 't_len'..
 
- In heapam_xlog.h, XLOG_HEAP_CONTAINS_OLD looks incomplete. And  it seems to be used in nowhere in this patchset. It
shouldbe  removed.
 
- log_heap_new_cid() is called at several part just before other  xlogs is being inserted. I suppose this should be
builtin the  target xlog structures.
 
- in RecovoerPreparedTransactions(), any commend needed for the  reason calling XLogLogicalInfoActive()..
- In xact.c, the comment for the member 'didLogXid' in  TransactionStateData seems differ from it's meaning. It
becomestrue when any WAL record for the current transaction  id just has been written to WAL buffer. So the comment,
 
  > /* has xid been included in WAL record? */
  would be better be something like (Should need corrected as  I'm not native speaker.)
   /* Any WAL record for this transaction has been emitted ? */
  and also the member name should be something like  XidIsLogged. (Not so chaned?)
- The name of the function MarkCurrentTransactionIdLoggedIfAny,  although irregular abbreviations are discouraged,
seemstoo  long. Isn't MarkCur(r/rent)XidLoggedIfAny sufficient?  Anyway,  the work involving this function seems would
bebetter to be  done in some other way..
 
- The comment for RelationGetIndexAttrBitmap() should be edited  for attrKind.
- The macro name INDEX_ATTR_BITMAP_KEY should be  INDEX_ATTR_BITMAP_FKEY. And INDEX_ATTR_BITMAP_IDENTITY_KEY  should be
INDEX_ATTR_BITMAP_REPLID_KEYor something.
 
- In rel.h the member name 'rd_idattr' would be better being  'rd_replidattr' or something like that.

===== 0004:
- Could the macro name 'RelationIsUsedAsCatalogTable' be as  simple as IsUserCatalogRelation or something? It's from
the viewpoint of not only simplicity but also similarity to other  macro and/or functions having closer functionality.
You already call the table 'user_catalog_table' in rel.h.
 

To be continued to next mail.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: logical changeset generation v6.7

From
Kyotaro HORIGUCHI
Date:
Hello, this is continued comments.

> ===== 0004:
....
> To be continued to next mail.


===== 0005:
- In heapam.c, it seems to be better replacing t_self only  during logical decoding.
- In GetOldestXmin(), the parameter name 'alreadyLocked' would  be better being simplly 'nolock' since alreadyLocked
seemsto  me suggesting that it will unlock the lock acquired beforehand.
 
- Before that, In LogicalDecodingAcquireFreeSlot, lock window  for procarray is extended after GetOldestXmin, but
procarray does not seem to be accessed during the additional period. If  you are holding this lock for the purpose
otherthan accessing  procArray, it'd be better to provide its own lock object.
 
   > LWLockAcquire(ProcArrayLock, LW_SHARED);   > slot->effective_xmin = GetOldestXmin(true, true, true);   >
slot->xmin= slot->effective_xmin;   >    > if (!TransactionIdIsValid(LogicalDecodingCtl->xmin) ||   >
NormalTransactionIdPrecedes(slot->effective_xmin,LogicalDecodingCtl->xmin))   >     LogicalDecodingCtl->xmin =
slot->effective_xmin;  > LWLockRelease(ProcArrayLock);
 
- In dropdb in dbcommands.c, ereport is invoked referring the  result of LogicalDecodingCountDBSlots. But it seems
betterto  me issueing this exception within LogicalDecodingCountDBSlots  even if goto is required.
 
- In LogStandbySnapshot in standby.c, two complementary  conditions are imposed on two same unlocks. It might be
somewhatparanoic but it is safer being like follows,
 
   | XLogRecPtr  recptr = InvalidXLogRecPtr;   | ....   |   | /* LogCurrentRunningXacts shoud be done before unlock
whenlogical decoding*/    | if (wal_level >= WAL_LEVEL_LOGICAL)   |    recptr = LogCurrentRunningXacts(running);   |
|LWLockRelease(ProcArrayLock);   |    | if (recptr == InvalidXLogRecPtr)   |    recptr =
LogCurrentRunningXacts(running);    - In tqual.c, in Setup/RevertFrom DecodingSnapshots, the name  CatalogSnapshotData
lookslacking unity with other  Snapshot*Data's.
 

===== 0007:
- In heapam.c, the new global variable 'RecentGlobalDataXmin' is  quite similar to 'RecentGlobalXmin' and does not
represents what it is. The name should be  changed. RecentGlobalNonCatalogXmin would be more preferable..
 
- Althgough simplly from my teste, the following part in  heapam.c
   > if (IsSystemRelation(scan->rs_rd)   >     || RelationIsAccessibleInLogicalDecoding(scan->rs_rd))   >
heap_page_prune_opt(scan->rs_rd,buffer, RecentGlobalXmin);   > else   >     heap_page_prune_opt(scan->rs_rd, buffer,
RecentGlobalDataXmin);
  would be readable to be like,
   > if (IsSystemRelation(scan->rs_rd)   >     || RelationIsAccessibleInLogicalDecoding(scan->rs_rd))   >   xmin =
RecentGlobalXmin  > else   >   xmin = RecentGlobalDataXmin   >     heap_page_prune_opt(scan->rs_rd, buffer, xmin);
 
   index_fetch_heap in indexam.c has similar part to this, and   you coded in latter style in IndexBuildHeapScan in
index.c.
- In procarray.c, you should add documentation for new parameter  'systable' for GetOldestXmin. And the name 'systable'
seems somewhat confusing, since its full-splled meaning is  'including systables'. This name should be changed to
'include_systable'or 'only_usertable' with inverting or  something..
 

0008 and after to come later..

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: logical changeset generation v6.7

From
Andres Freund
Date:
On 2013-11-29 01:16:39 -0500, Robert Haas wrote:
> On Thu, Nov 14, 2013 at 8:46 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > [ new patches ]
> 
> Here's an updated version of patch #2.  I didn't really like the
> approach you took in the documentation, so I revised it.

Fair enough.

> Apart from that, I spent a lot of time looking at
> HeapSatisfiesHOTandKeyUpdate.  I'm not very happy with your changes.
> The idea seems to be that we'll iterate through all of the HOT columns
> regardless, but that might be very inefficient.  Suppose there are 100
> HOT columns, the last one is the only key column, and only the first
> one has been modified.  Once we look at #1 and determine that it's not
> HOT, we should zoom forward and skip over the next 98, and only look
> at the last one; your version does not behave like that.

Well, the hot bitmap will only contains indexed columns, so for that's
only going to happen if there's actually indexes over all those
columns. And in that case it seems unlikely that the performance
of that routine matters.
That said, keeping the old performance characteristics seems like a good
idea to me. Not sure anymore why I changed it that way.

>  I've taken a crack at rewriting
> this logic, and the result looks cleaner and simpler to me, but I
> haven't tested it beyond the fact that it passes make check.  See what
> you think.

Hm. I think it actually will not abort early in call cases either, but
that looks fixable. Imagine what happens if id_attrs or key_attrs is
empty, ISTM that we'll check all hot columns in that case.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.7

From
Andres Freund
Date:
Hi,

On 2013-11-28 21:15:18 -0500, Robert Haas wrote:
> OK, I've committed the patch to adjust the definition of
> IsSystemRelation()/IsSystemClass() and add
> IsCatalogRelation()/IsCatalogClass().

Thanks for taking care of this!

>  I kibitzed your decision about
> which function to use in a few places - specifically, I made all of
> the places that cared about allow_system_table_mods uses the IsSystem
> functions, and all the places that cared about invalidation messages
> use the IsCatalog functions.  I don't think any of these changes are
> more cosmetic, but I think it may reduce the chance of errors or
> inconsistencies in the face of future changes.

Agreed.

Do you think we need to do anything about the
ERROR:  cannot remove dependency on schema pg_catalog because it is a system object
thingy? Imo the current state is much more consistent than the earlier
one, but that's still a quite surprising leftover...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.7

From
Robert Haas
Date:
On Tue, Dec 3, 2013 at 8:24 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-11-28 21:15:18 -0500, Robert Haas wrote:
>> OK, I've committed the patch to adjust the definition of
>> IsSystemRelation()/IsSystemClass() and add
>> IsCatalogRelation()/IsCatalogClass().
>
> Thanks for taking care of this!
>
>>  I kibitzed your decision about
>> which function to use in a few places - specifically, I made all of
>> the places that cared about allow_system_table_mods uses the IsSystem
>> functions, and all the places that cared about invalidation messages
>> use the IsCatalog functions.  I don't think any of these changes are
>> more cosmetic, but I think it may reduce the chance of errors or
>> inconsistencies in the face of future changes.
>
> Agreed.
>
> Do you think we need to do anything about the
> ERROR:  cannot remove dependency on schema pg_catalog because it is a system object
> thingy? Imo the current state is much more consistent than the earlier
> one, but that's still a quite surprising leftover...

I don't feel obliged to change it, but I also don't see a reason not
to clean it up.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.7

From
Robert Haas
Date:
On Tue, Dec 3, 2013 at 8:18 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>>  I've taken a crack at rewriting
>> this logic, and the result looks cleaner and simpler to me, but I
>> haven't tested it beyond the fact that it passes make check.  See what
>> you think.
>
> Hm. I think it actually will not abort early in call cases either, but
> that looks fixable. Imagine what happens if id_attrs or key_attrs is
> empty, ISTM that we'll check all hot columns in that case.

Yeah, you're right.  I think the current logic will terminate when all
flags are set to false or all attribute numbers have been checked, but
it doesn't know that if HOT's been disproven then we needn't consider
further HOT columns.  I think the way to fix that is to tweak this
part:

+               if (next_hot_attnum > FirstLowInvalidHeapAttributeNumber)                       check_now =
next_hot_attnum;
+               else if (next_key_attnum > FirstLowInvalidHeapAttributeNumber)
+                       check_now = next_key_attnum;
+               else if (next_id_attnum > FirstLowInvalidHeapAttributeNumber)
+                       check_now = next_id_attnum;               else
+                       break;

What I think we ought to do there is change each of those criteria to
say if (hot_result && next_hot_attnum >
FirstLowInvalidHeapAttributeNumber) and similarly for the other two.
That way we consider each set a valid source of attribute numbers only
until the result flag for that set flips false.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.7

From
Andres Freund
Date:
Hi,

On 2013-12-03 17:13:05 +0900, Kyotaro HORIGUCHI wrote:
> - Some patches have line offset to master. Rebase needed.

Will send the rebased version as soon as I've addressed your comments.

> ===== 0001:
> 
>  - You assined HeapTupleGetOid(tuple) into relid to read in
>    several points but no modification. Nevertheless, you left
>    HeapTupleGetOid not replaced there. I think 'relid' just for
>    repeated reading has far small merit compared to demerit of
>    lowering readability. You'd be better to make them uniform in
>    either way.

It's primarily to get the line lengths halfway under control.

> ===== 0002:
> 
>  - You are identifying the wal_level with the expr 'wal_level >=
>    WAL_LEVEL_LOGICAL' but it seems somewhat improper.

Hm. Why?

>  - RelationIsAccessibleInLogicalDecoding and
>    RelationIsLogicallyLogged are identical in code. Together with
>    the name ambiguity, this is quite confising and cause of
>    future misuse between these macros, I suppose. Plus the names
>    seem too long.

Hm, don't think they are equivalent, rather the contrary. Note one
returns false if IsCatalogRelation(), the other true.

>  - In heap_insert, the information conveyed on rdata[3] seems to
>    be better to be in rdata[2] because of the scarecity of the
>    additional information. XLOG_HEAP_CONTAINS_NEW_TUPLE only
>    seems to be needed. Also is in heap_multi_insert and
>    heap_upate.

Could you explain a bit more what you mean by that? The reason it's a
separate rdata entry is that otherwise a full page write will remove the
information.

>  - In heap_multi_insert, need_cids referred only once so might be
>    better removed.

It's accessed in a loop over potentially quite some items, that's why I
moved it into an extra variable.

>  - In heap_delete, at the point adding replica identity, same to
>    the aboves, rdata[3] could be moved into rdata[2] making new
>    type like 'xl_heap_replident'.

Hm. I don't think that'd be a good idea, because we'd then need special
case decoding code for deletes because the wal format would be different
for inserts/updates and deletes.

>  - In heapam_xlog.h, the new type xl_heap_header_len is
>    inadequate in both of naming which is confising and
>    construction on which the header in xl_heap_header is no
>    longer be a header and indecisive member name 't_len'..

The "header" bit in the name refers to the fact that it's containing
information about the a HeapTuple's header, not that it's a header
itself. Do you have a better suggestion than xl_heap_header_len?

>  - In heapam_xlog.h, XLOG_HEAP_CONTAINS_OLD looks incomplete. And
>    it seems to be used in nowhere in this patchset. It should be
>    removed.

Not sure what you mean with incomplete? It contains the both possible
variants for an old contained tuple. The macro is used in the decoding,
but I don't think things get clearer if we revise the macros in that
later patch.

>  - log_heap_new_cid() is called at several part just before other
>    xlogs is being inserted. I suppose this should be built in the
>    target xlog structures.

Proportionally it will only be logged in a absolute minority of the
cases (since normally the catalog will only seldomly be updated in
comparison to a user's tables), so it doesn't seem like a good idea to
complicate the already *horribly* complicated wal format for heap_*.

>  - in RecovoerPreparedTransactions(), any commend needed for the
>    reason calling XLogLogicalInfoActive()..

It's pretty much the "Test here must match one used in
AssignTransactionId()" comment. We only want to allow overwriting if
AssignTransactionId() might already have done the SubTransSetParent()
calls.

>  - In xact.c, the comment for the member 'didLogXid' in
>    TransactionStateData seems differ from it's meaning. It
>    becomes true when any WAL record for the current transaction
>    id just has been written to WAL buffer. So the comment,
> 
>    > /* has xid been included in WAL record? */
> 
>    would be better be something like (Should need corrected as
>    I'm not native speaker.)

>     /* Any WAL record for this transaction has been emitted ? */

I don't think that'd be an improvement, transaction is a bit ambigiuous
there because it might be the toplevel or subtransaction.

>    and also the member name should be something like
>    XidIsLogged. (Not so chaned?)

Hm.

>  - The name of the function MarkCurrentTransactionIdLoggedIfAny,
>    although irregular abbreviations are discouraged, seems too
>    long. Isn't MarkCur(r/rent)XidLoggedIfAny sufficient?

If you look at the other names in xact.h that doesn't seem to fit too
well in the naming pattern.

> Anyway,
>    the work involving this function seems would be better to be
>    done in some other way..

Why? How?

>  - The comment for RelationGetIndexAttrBitmap() should be edited
>    for attrKind.

Good point.

>  - The macro name INDEX_ATTR_BITMAP_KEY should be
>    INDEX_ATTR_BITMAP_FKEY. And INDEX_ATTR_BITMAP_IDENTITY_KEY
>    should be INDEX_ATTR_BITMAP_REPLID_KEY or something.

But INDEX_ATTR_BITMAP_KEY isn't just about foreign keys... But I agree
that INDEX_ATTR_BITMAP_IDENTITY_KEY should be renamed.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.7

From
Andres Freund
Date:
On 2013-12-03 19:15:53 +0900, Kyotaro HORIGUCHI wrote:
>  - In heapam.c, it seems to be better replacing t_self only
>    during logical decoding.

I don't see what'd be gained by that except make the test matrix bigger
at no gain.

>  - Before that, In LogicalDecodingAcquireFreeSlot, lock window
>    for procarray is extended after GetOldestXmin, but procarray
>    does not seem to be accessed during the additional period. If
>    you are holding this lock for the purpose other than accessing
>    procArray, it'd be better to provide its own lock object.

The comment above the part explains the reason:
/* * Acquire the current global xmin value and directly set the logical xmin * before releasing the lock if necessary.
Wedo this so wal decoding is * guaranteed to have all catalog rows produced by xacts with an xid > * walsnd->xmin
available.* * We can't use ComputeLogicalXmin here as that acquires ProcArrayLock * separately which would open a short
windowfor the global xmin to * advance above walsnd->xmin. */
 

>  - In dropdb in dbcommands.c, ereport is invoked referring the
>    result of LogicalDecodingCountDBSlots. But it seems better to
>    me issueing this exception within LogicalDecodingCountDBSlots
>    even if goto is required.

What if LogicalDecodingCountDBSlots() is needed in other places? That
seems like a layering violation to me.

>  - In LogStandbySnapshot in standby.c, two complementary
>    conditions are imposed on two same unlocks. It might be
>    somewhat paranoic but it is safer being like follows,

I don't see an advantage in that.

>  - In tqual.c, in Setup/RevertFrom DecodingSnapshots, the name
>    CatalogSnapshotData looks lacking unity with other
>    Snapshot*Data's.

That part needs a bit of work, agreed.

> ===== 0007:
> 
>  - In heapam.c, the new global variable 'RecentGlobalDataXmin' is
>    quite similar to 'RecentGlobalXmin' and does not represents
>    what it is. The name should be
>    changed. RecentGlobalNonCatalogXmin would be more preferable..

Hm. It's a mighty long name... but it indeed is clearer.

>  - Althgough simplly from my teste, the following part in
>    heapam.c
> 
>     > if (IsSystemRelation(scan->rs_rd)
>     >     || RelationIsAccessibleInLogicalDecoding(scan->rs_rd))
>     >     heap_page_prune_opt(scan->rs_rd, buffer, RecentGlobalXmin);
>     > else
>     >     heap_page_prune_opt(scan->rs_rd, buffer, RecentGlobalDataXmin);
> 
>    would be readable to be like,
> 
>     > if (IsSystemRelation(scan->rs_rd)
>     >     || RelationIsAccessibleInLogicalDecoding(scan->rs_rd))
>     >   xmin = RecentGlobalXmin
>     > else
>     >   xmin = RecentGlobalDataXmin
>     >     heap_page_prune_opt(scan->rs_rd, buffer, xmin);

Well, it requires introducing a new variable (which better not be named
xmin, but OldestXmin or similar). But I don't really care.

>     index_fetch_heap in indexam.c has similar part to this, and
>     you coded in latter style in IndexBuildHeapScan in index.c.

It's different there, because we do an explicit GetOldestXmin() call
there which we surely don't want to do twice.

> 0008 and after to come later..

Thanks for your review!

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.7

From
Kyotaro HORIGUCHI
Date:
Hello, this is cont'd comments.

> 0008 and after to come later..

I had nothing to comment for patch 0008.

===== 0009: 
- In repl_scanner.l, you omitted double-doublequote handling for  replication but it should be implemented. Zero-length
identifier check might be needed depending on the upper-layer.
 
- In walsender.c, the log messages "Initiating logical rep.."  and "Starting logical replication.." should be INFO or
LOGin  loglevel, not WARNING. And 'rep' in the former message would  be better not abbreviated since not done so in the
latter.
- In walsender.c, StartLogicalReplication seems trying to abort  itself for timeline change. But timeline changes in
9.3+don't  need such an aid. You'd better consult StartReplication in  current master for detail. There might be other
defferences.
- In walsender.c, the typedef name WalSndSendData doesn't seem  to be a function pointer. I suppose passing bare
function pointer to WanSndLoop and WalSndDone is not a good deed. It'd  be better to wrap it in any struct for
callback,say,  LogicalDecodingContext. It'd be even better if it could be a  common struct with 'physycal'
replication.
- In walsender.c, I wonder if the differences are necessary  between logical and physical replication in fetching
latest WALs, construction of WAL sending loop and so on .. Logical  walsender seems to be implimentated in somewhat
ad-hocway on  the whole. I belive it could be more commonize in the base  structure.
 
- In procarray.c, the added two includes which is not  accompanied by any other modification are needless. make emits
noerror or warning without them.
 

...Time's up. It'll be continued for later from 0010..

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: logical changeset generation v6.7

From
Andres Freund
Date:
On 2013-12-04 17:31:50 +0900, Kyotaro HORIGUCHI wrote:
> ===== 0009:
>
>  - In repl_scanner.l, you omitted double-doublequote handling for
>    replication but it should be implemented. Zero-length
>    identifier check might be needed depending on the upper-layer.

I am not sure what you mean here. IDENT can be double quoted, and so can
the option names?

>  - In walsender.c, the log messages "Initiating logical rep.."
>    and "Starting logical replication.." should be INFO or LOG in
>    loglevel, not WARNING. And 'rep' in the former message would
>    be better not abbreviated since not done so in the latter.

Agreed.

>  - In walsender.c, StartLogicalReplication seems trying to abort
>    itself for timeline change. But timeline changes in 9.3+ don't
>    need such an aid. You'd better consult StartReplication in
>    current master for detail. There might be other defferences.

Timeline increases currently need work, yes, that error messgage is the
smallest part...

>  - In walsender.c, the typedef name WalSndSendData doesn't seem
>    to be a function pointer. I suppose passing bare function
>    pointer to WanSndLoop and WalSndDone is not a good deed. It'd
>    be better to wrap it in any struct for callback, say,
>    LogicalDecodingContext. It'd be even better if it could be a
>    common struct with 'physycal' replication.

I don't see that as being realistic/advantageous. Wrapping a function
pointer in a struct doesn't improve anything in itself.

I was thinking we might want to just decouple the entire event loop and
not reuse that code, but that's ugly as well.

>  - In walsender.c, I wonder if the differences are necessary
>    between logical and physical replication in fetching latest
>    WALs, construction of WAL sending loop and so on .. Logical
>    walsender seems to be implimentated in somewhat ad-hoc way on
>    the whole. I belive it could be more commonize in the base
>    structure.

That's because the xlogreader.h interface - over my loud protests -
doesn't support chunk-wise reading of the WAL stream and neccessicates
blocking inside the reader callback. So the event loop needs to be in
several individual functions (WalSndLoop, WalSndWaitForWal,
WalSndWriteData) instead of once in WalSndLoop…

>  - In procarray.c, the added two includes which is not
>    accompanied by any other modification are needless. make emits
>    no error or warning without them.

Right. Will remove.

Thanks,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.8

From
Andres Freund
Date:
On 2013-12-03 15:19:26 -0500, Robert Haas wrote:
> Yeah, you're right.  I think the current logic will terminate when all
> flags are set to false or all attribute numbers have been checked, but
> it doesn't know that if HOT's been disproven then we needn't consider
> further HOT columns.  I think the way to fix that is to tweak this
> part:
>
> +               if (next_hot_attnum > FirstLowInvalidHeapAttributeNumber)
>                         check_now = next_hot_attnum;
> +               else if (next_key_attnum > FirstLowInvalidHeapAttributeNumber)
> +                       check_now = next_key_attnum;
> +               else if (next_id_attnum > FirstLowInvalidHeapAttributeNumber)
> +                       check_now = next_id_attnum;
>                 else
> +                       break;
>
> What I think we ought to do there is change each of those criteria to
> say if (hot_result && next_hot_attnum >
> FirstLowInvalidHeapAttributeNumber) and similarly for the other two.
> That way we consider each set a valid source of attribute numbers only
> until the result flag for that set flips false.

That seems to work well, yes.

Updated & rebased series attached.

* Rebased since the former patch 01 has been applied
* Lots of smaller changes in the wal_level=logical patch
  * Use Robert's version of wal_level=logical, with the above fixes
  * Use only macros for RelationIsAccessibleInLogicalDecoding/LogicallyLogged
  * Moved a mit more logic into ExtractReplicaIdentity
  * some comment copy-editing
  * Bug noted by Euler fixed, testcase added
* Some copy editing in later patches, nothing significant.

I've primarily sent this, because I don't know of further required
changes in 0001-0003. I am trying reviewing the other patches in detail
atm.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: logical changeset generation v6.7

From
Kyotaro HORIGUCHI
Date:
Hello,

> Will send the rebased version as soon as I've addressed your comments.

Thank you.

> > ===== 0001:
> > 
> >  - You assined HeapTupleGetOid(tuple) into relid to read in
> >    several points but no modification. Nevertheless, you left
> >    HeapTupleGetOid not replaced there. I think 'relid' just for
> >    repeated reading has far small merit compared to demerit of
> >    lowering readability. You'd be better to make them uniform in
> >    either way.
> 
> It's primarily to get the line lengths halfway under control.

Mm. I'm afraid I couldn't catch your words, do you mean that
IsSystemClass() or isTempNamespace() could change the NULL bitmap
in the tuple?

> > ===== 0002:
> > 
> >  - You are identifying the wal_level with the expr 'wal_level >=
> >    WAL_LEVEL_LOGICAL' but it seems somewhat improper.
> 
> Hm. Why?

It actually does no harm and somewhat trifling so I don't assert
you should fix it.

The reason for the comment is the greater values for wal_level
are undefined at the moment, so strictly saying, such values
should be handled as invalid ones. Although there is a practice
to avoid loop overruns by comparing counters with the expression
like (i > CEILING).
For instance, I found a macro for which comment reads as follows
and I feel a bit uneasy with it :-) It's nothing more than that.

| /* Do we need to WAL-log information required only for Hot Standby? */
~~~~
| #define XLogStandbyInfoActive() (wal_level >= WAL_LEVEL_HOT_STANDBY)

> >  - RelationIsAccessibleInLogicalDecoding and
> >    RelationIsLogicallyLogged are identical in code. Together with
> >    the name ambiguity, this is quite confising and cause of
> >    future misuse between these macros, I suppose. Plus the names
> >    seem too long.
> 
> Hm, don't think they are equivalent, rather the contrary. Note one
> returns false if IsCatalogRelation(), the other true.

Oops, I'm sorry. I understand they are not same. Then I have
other questions. The name for the first one
'RelationIsAccessibleInLogicalDecoding' doesn't seem representing
what its comment reads.

|  /* True if we need to log enough information to have access via
|      decoding snapshot. */

Making the macro name for this comment directly, I suppose it
would be something like 'NeedsAdditionalInfoInLogicalDecoding' or
more directly 'LogicalDeodingNeedsCids' or so..

> >  - In heap_insert, the information conveyed on rdata[3] seems to
> >    be better to be in rdata[2] because of the scarecity of the
> >    additional information. XLOG_HEAP_CONTAINS_NEW_TUPLE only
> >    seems to be needed. Also is in heap_multi_insert and
> >    heap_upate.
> 
> Could you explain a bit more what you mean by that? The reason it's a
> separate rdata entry is that otherwise a full page write will remove the
> information.

Sorry, I missed the comment 'so that an eventual FPW doesn't
remove the tuple's data'. Although given the necessity of removal
prevention, rewriting rdata[].buffer which is required by design
(correct?)  with InvalidBuffer seems a bit outrageous for me and
obfuscating the objective of it.  Other mechanism should be
preferable, I suppose. The most straight way to do that should be
new flag bit for XLogRecData, say, survive_fpw or something.

> >  - In heap_multi_insert, need_cids referred only once so might be
> >    better removed.
> 
> It's accessed in a loop over potentially quite some items, that's why I
> moved it into an extra variable.

Sorry bothering you with comments biside the point.. But the
scope of needs_cids is narrower than it is. I think the
definition should be moved into the block for 'if (needwal)'.

> >  - In heap_delete, at the point adding replica identity, same to
> >    the aboves, rdata[3] could be moved into rdata[2] making new
> >    type like 'xl_heap_replident'.
> 
> Hm. I don't think that'd be a good idea, because we'd then need special
> case decoding code for deletes because the wal format would be different
> for inserts/updates and deletes.

Hmm. Although one common xl_heap_replident is impractical,
splitting a logcally single entity into two or more XLogRecDatas
also seems not to be a good idea.

> >  - In heapam_xlog.h, the new type xl_heap_header_len is
> >    inadequate in both of naming which is confising and
> >    construction on which the header in xl_heap_header is no
> >    longer be a header and indecisive member name 't_len'..
> 
> The "header" bit in the name refers to the fact that it's containing
> information about the a HeapTuple's header, not that it's a header
> itself. Do you have a better suggestion than xl_heap_header_len?

Sorry, I'm confused during writing the comment, The order of
members in xl_heap_header_len doesn't matter.  I got the reason
for the xl_header_len and whole xlog record image after
re-reading the relevant code. The update record became to contain
two variable length data by this patch. So the length of the
tuple body cannot be calculated only with whole record length and
header lengths.

Considering above, looking heap_xlog_insert(), you marked on
xlrec.flags with XLOG_HEAP_CONTAINS_NEW_TUPLE to signal decoder
that the record should have tuple data not being removed by fpw.
This is the same for the update record. So the redoer(?) also can
distinguish whether the update record contains extra tuple data
or not.

On the other hand, the update record after patched is longer by
sizeof(uint16) regardless of whether 'tuple data' is attached or
not. I don't know the value of the size in WAL stream, but the
smaller would be better maybe.

As a conclusion, I thing it would be better to decide whether to
insert length SEGMENT before the tuple data segment in
log_heap_update. Like follows,

|   rdata[1].next = &(rdata[2]);
| 
|   xlhdr.t_infomask2 = newtup->t_data->t_infomask2;
|   xlhdr.t_infomask = newtup->t_data->t_infomask;
|   xlhdr.t_hoff = newtup->t_data->t_hoff;
| 
|   /*...*/
|   rdata[2].data = (char *) &xlhdr;
|   ...
|   rdata[2].next = &(rdata[3]);
| 
|   if (need_tuple_data)
|   {
|     uint16 newtupbodylen =
|        newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
|     rdata[3].data = &newtupbodylen;
|     ....
|   }
|   else
|   {
|     rdata[3].data = NULL;
|     rdata[3].len = 0;
|     ...
|   }
|   rdata[3].next = &(rdata[4]);
| 
|   /* PG73FGORMAT: write bitmap [+ padding] [+ oid] + data */
|   rdata[4].data = (char *) newtup->t_data;


> >  - In heapam_xlog.h, XLOG_HEAP_CONTAINS_OLD looks incomplete. And
> >    it seems to be used in nowhere in this patchset. It should be
> >    removed.
> 
> Not sure what you mean with incomplete? It contains the both possible
> variants for an old contained tuple. The macro is used in the decoding,
> but I don't think things get clearer if we revise the macros in that
> later patch.

Umm. I don't understand why I missed where it is used, it surely
used in decode.c as you mentioned. Ok, this is required.  Then
about 'imcomplete', it means 'CONTAINS OLD what?'...mmm,
logically speaking, lack of an object for the word 'OLD'. The
answer for the question should be 'both KEY and TUPLE'. Comparing
the revised images,

# Of course, it doesn't matter if the object for OLD can
# naturally be missing as English. I'm not a native English
# speaker as you know:-)

defining XLOG_HEAP_CONTAINS_OLD_KEY_AND_TUPLE,
|  if (xlrec->flags & XLOG_HEAP_CONTAINS_OLD_KEY_AND_TUPLE)
|  {

undefining XLOG_HEAP_CONTAINS_OLD and use separte macros type 1
|  if (xlrec->flags & XLOG_HEAP_CONTAINS_OLD_KEY ||
|      xlrec->flags & XLOG_HEAP_CONTAINS_OLD_TUPLE)
|  {
(I belive this should be optimized by the compiler:-)

and type 2
|  if (xlrec->flags &
|      (XLOG_HEAP_CONTAINS_OLD_KEY | XLOG_HEAP_CONTAINS_OLD_TUPLE))
|  {

I'm ok with any of them or others. In this connection, I found
following phrase in heapam.c which like type2 above.

|  if (!(old_infomask & (HEAP_XMAX_INVALID |
|                     HEAP_XMAX_COMMITTED |
|                        HEAP_XMAX_IS_MULTI)) &&


> >  - log_heap_new_cid() is called at several part just before other
> >    xlogs is being inserted. I suppose this should be built in the
> >    target xlog structures.
> 
> Proportionally it will only be logged in a absolute minority of the
> cases (since normally the catalog will only seldomly be updated in
> comparison to a user's tables), so it doesn't seem like a good idea to
> complicate the already *horribly* complicated wal format for heap_*.

Horribly:-) I agree to you. Hmm I reconsider with new
knowledge(!) about your patch. But what do you think doing this
as follows,
 if (RelationIsAccessibleInLogicalDecoding(relation))
+ {
|    rdata_heap_new_cid(&rdata[0], relation, heaptup);
+       xlrec.flags |= XLOG_HEAP_CONTAINS_NEW_CID;
+ }
+ else
+       rdata_void(&rdata[0])
+ rdata[0].next = &(rdata[1]); xlrec.flags = all_visible_cleared ? XLOG_HEAP_ALL_VISIBLE_CLEARED : 0; xlrec.target.node
=relation->rd_node; xlrec.target.tid = heaptup->t_self;
 
| rdata[1].data = (char *) &xlrec;

If you don't agree with this, I don't say no more about this.


> >  - in RecovoerPreparedTransactions(), any commend needed for the
> >    reason calling XLogLogicalInfoActive()..
> 
> It's pretty much the "Test here must match one used in
> AssignTransactionId()" comment. We only want to allow overwriting if
> AssignTransactionId() might already have done the SubTransSetParent()
> calls.

Thank you, I found it and it seems to be sufficient.

> >  - In xact.c, the comment for the member 'didLogXid' in
> >    TransactionStateData seems differ from it's meaning. It
> >    becomes true when any WAL record for the current transaction
> >    id just has been written to WAL buffer. So the comment,
> > 
> >    > /* has xid been included in WAL record? */
> > 
> >    would be better be something like (Should need corrected as
> >    I'm not native speaker.)
> 
> >     /* Any WAL record for this transaction has been emitted ? */
> 
> I don't think that'd be an improvement, transaction is a bit ambigiuous
> there because it might be the toplevel or subtransaction.

Hmm. Ok, I agree with it.

> >    and also the member name should be something like
> >    XidIsLogged. (Not so chaned?)
> 
> Hm.
> 
> >  - The name of the function MarkCurrentTransactionIdLoggedIfAny,
> >    although irregular abbreviations are discouraged, seems too
> >    long. Isn't MarkCur(r/rent)XidLoggedIfAny sufficient?
> 
> If you look at the other names in xact.h that doesn't seem to fit too
> well in the naming pattern.

(Now looking...) Wow. Ok, I agree to you.

> > Anyway,
> >    the work involving this function seems would be better to be
> >    done in some other way..
> 
> Why? How?

How... It is easy to comment but hard to realize. Ok, let's
forget this comment:-) The 'Why' came from some obscure
impression(?), without firm thought.

> > - The comment for RelationGetIndexAttrBitmap() should be edited
> >    for attrKind.
> 
> Good point.

Thanks.

> >  - The macro name INDEX_ATTR_BITMAP_KEY should be
> >    INDEX_ATTR_BITMAP_FKEY. And INDEX_ATTR_BITMAP_IDENTITY_KEY
> >    should be INDEX_ATTR_BITMAP_REPLID_KEY or something.
> 
> But INDEX_ATTR_BITMAP_KEY isn't just about foreign keys... But I agree
> that INDEX_ATTR_BITMAP_IDENTITY_KEY should be renamed.

Thank you, please remember to do that.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: logical changeset generation v6.7

From
Andres Freund
Date:
Hi,

On 2013-12-05 22:03:51 +0900, Kyotaro HORIGUCHI wrote:
> > >  - You assined HeapTupleGetOid(tuple) into relid to read in
> > >    several points but no modification. Nevertheless, you left
> > >    HeapTupleGetOid not replaced there. I think 'relid' just for
> > >    repeated reading has far small merit compared to demerit of
> > >    lowering readability. You'd be better to make them uniform in
> > >    either way.
> >
> > It's primarily to get the line lengths halfway under control.
>
> Mm. I'm afraid I couldn't catch your words, do you mean that
> IsSystemClass() or isTempNamespace() could change the NULL bitmap
> in the tuple?

Huh? No. I just meant that the source code lines are longer if you use
"HeapTupleGetOid(tuple)" instead of just "relid". Anway, that patch has
since been committed...

> > > ===== 0002:
> > >
> > >  - You are identifying the wal_level with the expr 'wal_level >=
> > >    WAL_LEVEL_LOGICAL' but it seems somewhat improper.
> >
> > Hm. Why?
>
> It actually does no harm and somewhat trifling so I don't assert
> you should fix it.
>
> The reason for the comment is the greater values for wal_level
> are undefined at the moment, so strictly saying, such values
> should be handled as invalid ones.

Note that other checks for wal_level are written the same way. Consider
how much bigger this patch would be if every usage of wal_level would
need to get changed because a new level had been added.

> > >  - RelationIsAccessibleInLogicalDecoding and
> > >    RelationIsLogicallyLogged are identical in code. Together with
> > >    the name ambiguity, this is quite confising and cause of
> > >    future misuse between these macros, I suppose. Plus the names
> > >    seem too long.
> >
> > Hm, don't think they are equivalent, rather the contrary. Note one
> > returns false if IsCatalogRelation(), the other true.
>
> Oops, I'm sorry. I understand they are not same. Then I have
> other questions. The name for the first one
> 'RelationIsAccessibleInLogicalDecoding' doesn't seem representing
> what its comment reads.
>
> |  /* True if we need to log enough information to have access via
> |      decoding snapshot. */
>
> Making the macro name for this comment directly, I suppose it
> would be something like 'NeedsAdditionalInfoInLogicalDecoding' or
> more directly 'LogicalDeodingNeedsCids' or so..

The comment talks about logging enough information that it is accessible
- just as the name.

> > >  - In heap_insert, the information conveyed on rdata[3] seems to
> > >    be better to be in rdata[2] because of the scarecity of the
> > >    additional information. XLOG_HEAP_CONTAINS_NEW_TUPLE only
> > >    seems to be needed. Also is in heap_multi_insert and
> > >    heap_upate.
> >
> > Could you explain a bit more what you mean by that? The reason it's a
> > separate rdata entry is that otherwise a full page write will remove the
> > information.
>
> Sorry, I missed the comment 'so that an eventual FPW doesn't
> remove the tuple's data'. Although given the necessity of removal
> prevention, rewriting rdata[].buffer which is required by design
> (correct?)  with InvalidBuffer seems a bit outrageous for me and
> obfuscating the objective of it.

Well, it's added in a separate rdata element. Just as in dozens of other
places.

> > >  - In heap_delete, at the point adding replica identity, same to
> > >    the aboves, rdata[3] could be moved into rdata[2] making new
> > >    type like 'xl_heap_replident'.
> >
> > Hm. I don't think that'd be a good idea, because we'd then need special
> > case decoding code for deletes because the wal format would be different
> > for inserts/updates and deletes.
>
> Hmm. Although one common xl_heap_replident is impractical,
> splitting a logcally single entity into two or more XLogRecDatas
> also seems not to be a good idea.

That's done everywhere. There's basically two reasons to use separate
rdata elements:
* the buffers are different
* the data pointer is different

The rdata chain elements don't survive in the WAL.

> Considering above, looking heap_xlog_insert(), you marked on
> xlrec.flags with XLOG_HEAP_CONTAINS_NEW_TUPLE to signal decoder
> that the record should have tuple data not being removed by fpw.
> This is the same for the update record. So the redoer(?) also can
> distinguish whether the update record contains extra tuple data
> or not.

But it doesn't know the length of the individual records, so knowing
there are two doesn't help.

> On the other hand, the update record after patched is longer by
> sizeof(uint16) regardless of whether 'tuple data' is attached or
> not. I don't know the value of the size in WAL stream, but the
> smaller would be better maybe.

I think that'd make things too complicated without too much gain in
comparison to the savings.

> # Of course, it doesn't matter if the object for OLD can
> # naturally be missing as English.

Well, I think the context makes it clear enough.

> I'm not a native English speaker as you know:-)

Neither am I ;). 

> undefining XLOG_HEAP_CONTAINS_OLD and use separte macros type 1
> |  if (xlrec->flags & XLOG_HEAP_CONTAINS_OLD_KEY ||
> |      xlrec->flags & XLOG_HEAP_CONTAINS_OLD_TUPLE)
> |  {
> (I belive this should be optimized by the compiler:-)

It's not about efficiency, imo the other variant looks clearer. And will
continue to work if we add the option to selectively log columns or
such.

>   if (RelationIsAccessibleInLogicalDecoding(relation))
> + {
> |    rdata_heap_new_cid(&rdata[0], relation, heaptup);
> +       xlrec.flags |= XLOG_HEAP_CONTAINS_NEW_CID;
> + }
> + else
> +       rdata_void(&rdata[0])
> + rdata[0].next = &(rdata[1]);
>
>   xlrec.flags = all_visible_cleared ? XLOG_HEAP_ALL_VISIBLE_CLEARED : 0;
>   xlrec.target.node = relation->rd_node;
>   xlrec.target.tid = heaptup->t_self;
> | rdata[1].data = (char *) &xlrec;
>
> If you don't agree with this, I don't say no more about this.

It's a lot more complex than that. This throws of all kind of size
calculations. and it makes the redo functions more complex - and they
are much more often executed for !CONTAINS_NEW_CID.

> > >  - The macro name INDEX_ATTR_BITMAP_KEY should be
> > >    INDEX_ATTR_BITMAP_FKEY. And INDEX_ATTR_BITMAP_IDENTITY_KEY
> > >    should be INDEX_ATTR_BITMAP_REPLID_KEY or something.
> >
> > But INDEX_ATTR_BITMAP_KEY isn't just about foreign keys... But I agree
> > that INDEX_ATTR_BITMAP_IDENTITY_KEY should be renamed.
>
> Thank you, please remember to do that.

I am not sure it's a good idea anymore, it doesn't really seem to
increase clarity...

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.7

From
Kyotaro HORIGUCHI
Date:
Hello, sorry for annoying you with meaningless questions.
Your explanation made it far clearer to me.

This will be the last message I mention on this patch..

> On 2013-12-05 22:03:51 +0900, Kyotaro HORIGUCHI wrote:
> > > >  - You assined HeapTupleGetOid(tuple) into relid to read in
> > > >    several points but no modification. Nevertheless, you left
> > > >    HeapTupleGetOid not replaced there. I think 'relid' just for
> > > >    repeated reading has far small merit compared to demerit of
> > > >    lowering readability. You'd be better to make them uniform in
> > > >    either way.
> > >
> > > It's primarily to get the line lengths halfway under control.
> >
> > Mm. I'm afraid I couldn't catch your words, do you mean that
> > IsSystemClass() or isTempNamespace() could change the NULL bitmap
> > in the tuple?
>
> Huh? No. I just meant that the source code lines are longer if you  use
> "HeapTupleGetOid(tuple)" instead of just "relid". Anway, that patch  has
> since been committed...

Really? Sorry for annoying you. Thank you, I understand that.

> > > > ===== 0002:
> > > >
> > > >  - You are identifying the wal_level with the expr 
evel >=
> > > >    WAL_LEVEL_LOGICAL' but it seems somewhat improper.
> > >
> > > Hm. Why?
> >
> > It actually does no harm and somewhat trifling so I don't 
> > you should fix it.
> >
> > The reason for the comment is the greater values for vel
> > are undefined at the moment, so strictly saying, such values
> > should be handled as invalid ones.
>
> Note that other checks for wal_level are written the same
> way. Consider how much bigger this patch would be if every
> usage of wal_level would need to get changed because a new
> level had been added.

I know the objective. But it is not obvious that the future value
will need the process. Anyway we never know that until it
actually comes so I said it trifling.


> > > >  - RelationIsAccessibleInLogicalDecoding and
> > > >    RelationIsLogicallyLogged are identical in code. Together with
> > > >    the name ambiguity, this is quite confising and cause of
> > > >    future misuse between these macros, I suppose. Plus the names
> > > >    seem too long.
> > >
> > > Hm, don't think they are equivalent, rather the contrary. Note one
> > > returns false if IsCatalogRelation(), the other true.
> >
> > Oops, I'm sorry. I understand they are not same. Then I have
> > other questions. The name for the first one
> > 'RelationIsAccessibleInLogicalDecoding' doesn't seem representing
> > what its comment reads.
> >
> > |  /* True if we need to log enough information to have access via
> > |      decoding snapshot. */
> >
> > Making the macro name for this comment directly, I suppose it
> > would be something like 'NeedsAdditionalInfoInLogicalDecoding' or
> > more directly 'LogicalDeodingNeedsCids' or so..
> 
> The comment talks about logging enough information that it is accessible
> - just as the name.

Though I'm not convinced for that. But since it also seems not so
wrong and you say so, I pretend to be convinced:-p

> > > >  - In heap_insert, the information conveyed on rdata[3] seems to
> > > >    be better to be in rdata[2] because of the scarecity of the
> > > >    additional information. XLOG_HEAP_CONTAINS_NEW_TUPLE only
> > > >    seems to be needed. Also is in heap_multi_insert and
> > > >    heap_upate.
> > >
> > > Could you explain a bit more what you mean by that? The reason it's a
> > > separate rdata entry is that otherwise a full page write will remove the
> > > information.
> >
> > Sorry, I missed the comment 'so that an eventual FPW doesn't
> > remove the tuple's data'. Although given the necessity of removal
> > prevention, rewriting rdata[].buffer which is required by design
> > (correct?)  with InvalidBuffer seems a bit outrageous for me and
> > obfuscating the objective of it.
> 
> Well, it's added in a separate rdata element. Just as in dozens of other
> places.

Mmmm. Was there any rdata entriy which has substantial content
but .buffer is set to InvalidBuffer just for avoiding removal by
fpw? Although for the objection I made, I became to be accostomed
to see there and I became to think it is not so bad.. I put an
end by this comment.

> > > >  - In heap_delete, at the point adding replica identity, same to
> > > >    the aboves, rdata[3] could be moved into rdata[2] making new
> > > >    type like 'xl_heap_replident'.
> > >
> > > Hm. I don't think that'd be a good idea, because we'd then need special
> > > case decoding code for deletes because the wal format would be different
> > > for inserts/updates and deletes.
> >
> > Hmm. Although one common xl_heap_replident is impractical,
> > splitting a logcally single entity into two or more XLogRecDatas
> > also seems not to be a good idea.
> 
> That's done everywhere. There's basically two reasons to use separate
> rdata elements:
> * the buffers are different
> * the data pointer is different
> 
> The rdata chain elements don't survive in the WAL.

Well, I came to see rdata's as simple containers holding
fragments to be written into WAL stream. Thanks for patiently
answering for such silly questions.

> > Considering above, looking heap_xlog_insert(), you marked on
> > xlrec.flags with XLOG_HEAP_CONTAINS_NEW_TUPLE to signal decoder
> > that the record should have tuple data not being removed by fpw.
> > This is the same for the update record. So the redoer(?) also can
> > distinguish whether the update record contains extra tuple data
> > or not.
> 
> But it doesn't know the length of the individual records, so knowing
> there are two doesn't help.

The length can be attached only when
XLOG_HEAP_CONTAINS_NEW_TUPLE, but it is right for you to say
that's too complicated.

> > On the other hand, the update record after patched is longer by
> > sizeof(uint16) regardless of whether 'tuple data' is attached or
> > not. I don't know the value of the size in WAL stream, but the
> > smaller would be better maybe.
> 
> I think that'd make things too complicated without too much gain in
> comparison to the savings.

As written above, I'm convinced with that.

> > # Of course, it doesn't matter if the object for OLD can
> > # naturally be missing as English.
> 
> Well, I think the context makes it clear enough.

Thank you. I learned that, perhaps :-)

> > I'm not a native English speaker as you know:-)
> 
> Neither am I ;). 

Yeah, you should've been more experienced than me :-p

> > undefining XLOG_HEAP_CONTAINS_OLD and use separte macros type 1
> > |  if (xlrec->flags & XLOG_HEAP_CONTAINS_OLD_KEY ||
> > |      xlrec->flags & XLOG_HEAP_CONTAINS_OLD_TUPLE)
> > |  {
> > (I belive this should be optimized by the compiler:-)
> 
> It's not about efficiency, imo the other variant looks clearer. And will
> continue to work if we add the option to selectively log columns or
> such.

Well, It's ok for me after knowing that the 'OLD' alone makes
sense.

> >   if (RelationIsAccessibleInLogicalDecoding(relation))
> > + {
> > |    rdata_heap_new_cid(&rdata[0], relation, heaptup);
> > +       xlrec.flags |= XLOG_HEAP_CONTAINS_NEW_CID;
> > + }
> > + else
> > +       rdata_void(&rdata[0])
> > + rdata[0].next = &(rdata[1]);
> >
> >   xlrec.flags = all_visible_cleared ? XLOG_HEAP_ALL_VISIBLE_CLEARED : 0;
> >   xlrec.target.node = relation->rd_node;
> >   xlrec.target.tid = heaptup->t_self;
> > | rdata[1].data = (char *) &xlrec;
> >
> > If you don't agree with this, I don't say no more about this.
> 
> It's a lot more complex than that. This throws of all kind of size
> calculations. and it makes the redo functions more complex - and they
> are much more often executed for !CONTAINS_NEW_CID.

Mmm. I don't know the exact reason for omitting length field in
previous xlog format. I've supposed the saved bytes are more
significant than the calculation clocks. But it sould be wrong or
the efficiency was defeated by the expected coplexity, supposing
from that it is alrady committed.

> > > >  - The macro name INDEX_ATTR_BITMAP_KEY should be
> > > >    INDEX_ATTR_BITMAP_FKEY. And INDEX_ATTR_BITMAP_IDENTITY_KEY
> > > >    should be INDEX_ATTR_BITMAP_REPLID_KEY or something.
> > >
> > > But INDEX_ATTR_BITMAP_KEY isn't just about foreign keys... But I agree
> > > that INDEX_ATTR_BITMAP_IDENTITY_KEY should be renamed.
> >
> > Thank you, please remember to do that.
> 
> I am not sure it's a good idea anymore, it doesn't really seem to
> increase clarity...

And also might be excessive. But it seemed somewhat unclear to
unaccustomed eyes:-)

Thank you for all your answers.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: logical changeset generation v6.8

From
Robert Haas
Date:
On Wed, Dec 4, 2013 at 10:55 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-12-03 15:19:26 -0500, Robert Haas wrote:
>> Yeah, you're right.  I think the current logic will terminate when all
>> flags are set to false or all attribute numbers have been checked, but
>> it doesn't know that if HOT's been disproven then we needn't consider
>> further HOT columns.  I think the way to fix that is to tweak this
>> part:
>>
>> +               if (next_hot_attnum > FirstLowInvalidHeapAttributeNumber)
>>                         check_now = next_hot_attnum;
>> +               else if (next_key_attnum > FirstLowInvalidHeapAttributeNumber)
>> +                       check_now = next_key_attnum;
>> +               else if (next_id_attnum > FirstLowInvalidHeapAttributeNumber)
>> +                       check_now = next_id_attnum;
>>                 else
>> +                       break;
>>
>> What I think we ought to do there is change each of those criteria to
>> say if (hot_result && next_hot_attnum >
>> FirstLowInvalidHeapAttributeNumber) and similarly for the other two.
>> That way we consider each set a valid source of attribute numbers only
>> until the result flag for that set flips false.
>
> That seems to work well, yes.
>
> Updated & rebased series attached.
>
> * Rebased since the former patch 01 has been applied
> * Lots of smaller changes in the wal_level=logical patch
>   * Use Robert's version of wal_level=logical, with the above fixes
>   * Use only macros for RelationIsAccessibleInLogicalDecoding/LogicallyLogged
>   * Moved a mit more logic into ExtractReplicaIdentity
>   * some comment copy-editing
>   * Bug noted by Euler fixed, testcase added
> * Some copy editing in later patches, nothing significant.
>
> I've primarily sent this, because I don't know of further required
> changes in 0001-0003. I am trying reviewing the other patches in detail
> atm.

Committed #1 (again).  Regarding this:

+       /* XXX: we could also do this unconditionally, the space is used anyway
+       if (copy_oid)
+               HeapTupleSetOid(key_tuple, HeapTupleGetOid(tp));

I would like to put in a big +1 for doing that unconditionally.  I
didn't make that change before committing, but I think it'd be a very
good idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.8

From
Robert Haas
Date:
On Wed, Dec 4, 2013 at 10:55 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> I've primarily sent this, because I don't know of further required
> changes in 0001-0003. I am trying reviewing the other patches in detail
> atm.

Committed #3 also.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.8

From
Robert Haas
Date:
On Wed, Dec 4, 2013 at 10:55 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> [ updated logical decoding patches ]

Regarding patch #4, introduce wal decoding via catalog timetravel,
which seems to be the bulk of what's not committed at this point...

- I think this needs SGML documentation, same kind of thing we have
for background workers, except probably significantly more.  A design
document with ASCII art in a different patch does not constitute
usable documentation.  I think it would fit best under section VII,
internals, perhaps entitled "Writing a Logical Decoding Plugin".
Looking at it, I rather wonder if the "Background Worker Processes"
ought to be there as well, rather than under section V, server
programming.

+                       /* setup the redirected t_self for the benefit
of logical decoding */
...
+                       /* reset to original, non-redirected, tid */

Clear as mud.

+ * rrow to disk but only do so in batches when we've collected several of them

Typo.

+ * position before those records back. Independently from WAL logging,

"before those records back"?

+ * position before those records back. Independently from WAL logging,
+ * everytime a rewrite is finished all generated mapping files are directly

I would delete "Independently from WAL logging" from this sentence.
And "everytime" is two words.

+ * file. That leaves the tail end that has not yet been fsync()ed to disk open
...
+ * fsynced.

Pick a spelling and stick with it.  My suggestion is "flushed to
disk", not actually using fsync per se at all.

+ * TransactionDidCommit() check. But we want to support physical replication
+ * for availability and to support logical decoding on the standbys.

What is physical replication for availability?

+ * necessary. If we detect that we don't need to log anything we'll prevent
+ * any further action by logical_*rewrite*

Sentences should end with a period, and the reason for the asterisks
is not clear.

+       logical_xmin =
+               ((volatile LogicalDecodingCtlData*) LogicalDecodingCtl)->xmin;

Ugly.

+                                        errmsg("failed to write to
logical remapping file: %m")));

Message style.

+                       ereport(ERROR,
+                                       (errcode_for_file_access(),
+                                        errmsg("incomplete write to
logical remapping file, wrote %d of %u",
+                                                       written, len)));

Message style.  I suggest treating a short write as ENOSPC; there is
precedent elsewhere.

I don't think there's much point in including "remapping" in all of
the error messages.  It adds burden for translators and users won't
know what a remapping file is anyway.

+               /*
+                * We intentionally violate the usual WAL coding
practices here and
+                * write to the file *first*. This way an eventual
checkpoint will
+                * sync any part of the file that's not guaranteed to
be recovered by
+                * the XLogInsert(). We deal with the potential
corruption in the tail
+                * of the file by truncating it to the last safe point
during WAL
+                * replay and by checking whether the xid performing
the mapping has
+                * committed.
+                */

Don't have two different comments explaining this.  Either replace
this one with a reference to the other one, or get rid of the other
one and just keep this one.  I vote for the latter.

I don't see a clear explanation anywhere of what the
rs_logical_mappings hash is actually doing.  This is badly needed.
This code basically presupposes that you know what it's try to
accomplish, and even though I sort of do, it leaves a lot to be
desired in terms of clarity.

+       /* nothing to do if we're not working on a catalog table */
+       if (!state->rs_logical_rewrite)
+               return;

Comment doesn't accurately describe code.

+       /* use *GetUpdateXid to correctly deal with multixacts */
+       xmax = HeapTupleHeaderGetUpdateXid(new_tuple->t_data);

I don't feel enlightened by that comment.

+       if (!TransactionIdIsNormal(xmax))
+       {
+               /*
+                * no xmax is set, can't have any permanent ones, so
this check is
+                * sufficient
+                */
+       }
+       else if (HEAP_XMAX_IS_LOCKED_ONLY(new_tuple->t_data->t_infomask))
+       {
+               /* only locked, we don't care */
+       }
+       else if (!TransactionIdPrecedes(xmax, cutoff))
+       {
+               /* tuple has been deleted recently, log */
+               do_log_xmax = true;
+       }

Obfuscated.  Rewrite without empty blocks.

+       /*
+        * Write out buffer everytime we've too many in-memory entries.
+        */
+       if (state->rs_num_rewrite_mappings >= 1000 /* arbitrary number */)
+               logical_heap_rewrite_flush_mappings(state);

What happens if the number of items per hash table entry is small but
the number of entries is large?

+               /* XXX: should we warn about such files? */

Nah.

+                                                errmsg("Could not
fsync logical remapping file \"%s\": %m",

Capitalization.

+ *             Decodes WAL records fed from xlogreader.h read into an
reorderbuffer
+ *             while simultaneously letting snapbuild.c build an appropriate
+ *             snapshots to decode those.

This comment doesn't seem to have very good grammar, and it's just a
wee bit less explanation than is warranted.

+ * Take every XLogReadRecord()ed record and perform the actions required to

I'm generally not that fond of using function names as verbs.

+                * Rmgrs irrelevant for changeset extraction, they
describe stuff not
+                * represented in logical decoding. Add new rmgrs in
rmgrlist.h's
+                * order.

The following resource managers are irrelevant for changeset
extraction, because they describe...

+               case RM_NEXT_ID:
+               default:
+                       elog(ERROR, "unexpected RM_NEXT_ID rmgr_id");

Message doesn't match code.

+                       /* XXX: we could replay the transaction and
prepare it as well. */

Should we do that?

+                                * Abort all transactions that we keep
track of that are older

Come on.  You're not aborting anything; you're throwing away state
because you know it did abort.  The function naming here could maybe
use some work, too.  ReorderBufferDiscardXID()?

+                                * for doing so since, in contrast to
shutdown or end of
+                                * recover checkpoints, we have
sufficient knowledge to deal

recovery, not recover

+                        * XXX: There doesn't seem to be a usecase for decoding

Why XXX?

+               case XLOG_HEAP_INPLACE:
+                       /*
+                        * Cannot be important for our purposes, not
part of transactions.
+                        */
+                       if (!TransactionIdIsValid(xid))
+                               break;
+
+                       SnapBuildProcessChange(builder, xid, buf->origptr);
+                       /* heap_inplace is only done in catalog
modifying txns */
+
ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+                       break;

It is not clear to me why we care about some instances of this and not others.

+ * logical.c
+ *
+ *        Logical decoding shared memory management
...
+ * logical decoding on-disk data structures.

So, apparently it's more than just shared memory management.

+       /*
+        * don't overwrite if we already have a newer xmin. This can
+        * happen if we restart decoding in a slot.
+        */
+       if (TransactionIdPrecedesOrEquals(xmin, MyLogicalDecodingSlot->xmin))
+       {
+       }
+       /*
+        * If the client has already confirmed up to this lsn, we directly
+        * can mark this as accepted. This can happen if we restart
+        * decoding in a slot.
+        */
+       else if (lsn <= MyLogicalDecodingSlot->confirmed_flush)

Try to avoid empty blocks.  And we don't normally put comments between
the closing brace of the if and the else clause.

+               elog(DEBUG1, "got new xmin %u at %X/%X", xmin,
+                        (uint32) (lsn >> 32), (uint32) lsn);
+       }
+       SpinLockRelease(&MyLogicalDecodingSlot->mutex);

Don't elog() while holding a spinlock.

+XLogRecPtr ComputeLogicalRestartLSN(void)

Formatting.

+       if (wal_level < WAL_LEVEL_LOGICAL)
+               ereport(ERROR,
+
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                                errmsg("logical decoding requires
wal_level=logical")));
+
+       if (MyDatabaseId == InvalidOid)
+               ereport(ERROR,
+
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                                errmsg("logical decoding requires to
be connected to a database")));
+
+       if (max_logical_slots == 0)
+               ereport(ERROR,
+
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                                (errmsg("logical decoding requires
needs max_logical_slots > 0"))));
+

Message style, times three.

+                                errmsg("There already is a logical
slot named \"%s\"", name)));

And again.

All right, I'm giving up for now.  These patches need a LOT of work on
comments and documentation to be committable, even if the underlying
architecture is basically sound.  There's a lot of stuff that has no
comment at all, and a lot of the comments that do exist are basically
recapitulating what the code says (or in some cases, not what the code
says) rather than explaining what the purpose of all of this stuff is
at a conceptual level.  The comment at the header of reorderbuffer.c,
for example, is well-written and cogent, but there's a lot of places
where similar detail is needed but lacking.  I realize that it isn't
project policy for every function to have a header comment but at the
very least I think it'd be worth asking, for each one, why it doesn't
need one, and/or what information could be provided in such a comment
to most effectively inform the reader.

+        * We free separately allocated data by entirely scrapping oure personal

Spelling.


+        * clog. If we're doing logical replication we can't do that though, so
+        * hold the lock for a moment longer.

...because why?

I'm still unhappy that we're introducing logical decoding slots but no
analogue for physical replication.  If we had the latter, would it
share meaningful amounts of structure with this?

+                * noncompatible way, but those are prevented both on catalog
+                * tables and on user tables declared as additional catalog
+                * tables.

Really?

My eyes are starting to glaze over, so really stopping here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.8

From
Andres Freund
Date:
On 2013-12-10 19:11:03 -0500, Robert Haas wrote:
> Committed #1 (again).

Thanks!

>  Regarding this:
> 
> +       /* XXX: we could also do this unconditionally, the space is used anyway
> +       if (copy_oid)
> +               HeapTupleSetOid(key_tuple, HeapTupleGetOid(tp));
> 
> I would like to put in a big +1 for doing that unconditionally.  I
> didn't make that change before committing, but I think it'd be a very
> good idea.

Ok. I wasn't sure if it wouldn't be wierd to include the oid in the
tuple logged for a replica identity that doesn't cover the oid. But the
downside is pretty small...

Will send a patch.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.8

From
Andres Freund
Date:
On 2013-12-10 19:11:03 -0500, Robert Haas wrote:
> Committed #1 (again).  Regarding this:
>
> +       /* XXX: we could also do this unconditionally, the space is used anyway
> +       if (copy_oid)
> +               HeapTupleSetOid(key_tuple, HeapTupleGetOid(tp));
>
> I would like to put in a big +1 for doing that unconditionally.  I
> didn't make that change before committing, but I think it'd be a very
> good idea.

Patch attached.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: logical changeset generation v6.8

From
Robert Haas
Date:
On Wed, Dec 11, 2013 at 11:25 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-12-10 19:11:03 -0500, Robert Haas wrote:
>> Committed #1 (again).  Regarding this:
>>
>> +       /* XXX: we could also do this unconditionally, the space is used anyway
>> +       if (copy_oid)
>> +               HeapTupleSetOid(key_tuple, HeapTupleGetOid(tp));
>>
>> I would like to put in a big +1 for doing that unconditionally.  I
>> didn't make that change before committing, but I think it'd be a very
>> good idea.
>
> Patch attached.

Committed with kibitzing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.8

From
Andres Freund
Date:
Hi,

Lots of sensible comments removed, I plan to make changes to
address them.

On 2013-12-10 22:17:44 -0500, Robert Haas wrote:
> - I think this needs SGML documentation, same kind of thing we have
> for background workers, except probably significantly more.  A design
> document with ASCII art in a different patch does not constitute
> usable documentation.  I think it would fit best under section VII,
> internals, perhaps entitled "Writing a Logical Decoding Plugin".
> Looking at it, I rather wonder if the "Background Worker Processes"
> ought to be there as well, rather than under section V, server
> programming.

Completely agreed it needs that. I'd like the UI decisions in
http://www.postgresql.org/message-id/20131205001520.GA8935@awork2.anarazel.de
resolved before though.

> +       logical_xmin =
> +               ((volatile LogicalDecodingCtlData*) LogicalDecodingCtl)->xmin;
> 
> Ugly.

We really should have a macro for this...

> Message style.  I suggest treating a short write as ENOSPC; there is
> precedent elsewhere.
> 
> I don't think there's much point in including "remapping" in all of
> the error messages.  It adds burden for translators and users won't
> know what a remapping file is anyway.

It helps in locating wich part of the code caused a problem. I utterly
hate to get reports with error messages that I can't correlate to a
sourcefile. Yes, I know that verbose error output exists, but usually
you don't get it that way... That said, I'll try to make the messages
simpler.

> +       /* use *GetUpdateXid to correctly deal with multixacts */
> +       xmax = HeapTupleHeaderGetUpdateXid(new_tuple->t_data);
>
> I don't feel enlightened by that comment.

Well, it will return the update xid of a tuple locked with NO KEY SHARE
and updated (with NO KEY UPDATE). I.e. 9.3+ foreign key locking stuff.

> +       if (!TransactionIdIsNormal(xmax))
> +       {
> +               /*
> +                * no xmax is set, can't have any permanent ones, so
> this check is
> +                * sufficient
> +                */
> +       }
> +       else if (HEAP_XMAX_IS_LOCKED_ONLY(new_tuple->t_data->t_infomask))
> +       {
> +               /* only locked, we don't care */
> +       }
> +       else if (!TransactionIdPrecedes(xmax, cutoff))
> +       {
> +               /* tuple has been deleted recently, log */
> +               do_log_xmax = true;
> +       }
> 
> Obfuscated.  Rewrite without empty blocks.

I don't understand why having an empty block is less clear than
including a condition in several branches? Especially if the individual
conditions might need to be documented?

> +       /*
> +        * Write out buffer everytime we've too many in-memory entries.
> +        */
> +       if (state->rs_num_rewrite_mappings >= 1000 /* arbitrary number */)
> +               logical_heap_rewrite_flush_mappings(state);
> 
> What happens if the number of items per hash table entry is small but
> the number of entries is large?

rs_num_rewrite_mappings is the overall number of in-memory mappings, not
the number of per-entry mappings. That means we flush, even if all
entries have only a couple of mappings, as soon as 1000 in memory
entries have been collected. A bit simplistic, but seems okay enough?

> +               /* XXX: should we warn about such files? */
> 
> Nah.

Ok, will remove that comment from a couple locations then...

> +                       /* XXX: we could replay the transaction and
> prepare it as well. */
> 
> Should we do that?

It would allow a neat feature, namely using 2PC to make sure that a
transaction commit on all the nodes connected using changeset
extraction. But I think that's a feature for later. Its use would have
to be optional anyway.


> All right, I'm giving up for now.  These patches need a LOT of work on
> comments and documentation to be committable, even if the underlying
> architecture is basically sound.

> +        * clog. If we're doing logical replication we can't do that though, so
> +        * hold the lock for a moment longer.
> 
> ...because why?

That's something pretty icky imo. But more in the existing HS code than
in this. Without the changes in the locking we can have the situation
that transactions are marked as running in the xl_running_xact record,
but are actually already committed. There's some code for HS that tries
to cope with that situation but I don't trust it very much, and it'd be
more complicated to make it work for logical decoding. I could reproduce
problems for the latter without those changes.

I'll add a comment explaining this.

> I'm still unhappy that we're introducing logical decoding slots but no
> analogue for physical replication.  If we had the latter, would it
> share meaningful amounts of structure with this?

Yes, I think we could mostly reuse it, we'd probably want to add a field
or two more (application_name, sync_prio?). I have been wondering
whether some of the code in replication/logical/logical.c shouldn't be
in replication/slot.c or similar. So far I've opted for leaving it in
its current place since it would have to change a bit for a more general
role.

I still think that the technical parts of properly supporting persistent
slots for physical rep aren't that hard, but that the behavioural
decisions are. I think there are primarily two things for SR that we
want to prevent using slots:
a) removal of still used WAL (i.e. maintain knowledge about a standby's  last required REDO location)
b) make hot_standby_feedback work across disconnections of the walsender  connection (i.e peg xmin, not just for
catalogsthough)
 
c) Make sure we can transport those across cascading  replication.
once those are there it's also useful to keep a bit more information
about the state of replicas:
* write, flush, apply lsn

The hard questions that I see are like:
* How do we manage standby registration? Does the admin have to do that manually? Or does a standby register itself
automaticallyif some config paramter is set?
 
* If automatically, how do we deal with the situation that registrant dies before noting his own identifier somewhere
persistent?My best idea is a two phase registration process where registration in phase 1 are thrown away after a
restart,but yuck.
 
* How do we deal with the usability wart that people *will* forget to delete a slot when removing a node?
* What requirements do we have for transporting knowlede about this across a failover?

I have little hope that we can resolve all that for 9.4.

> + * noncompatible way, but those are prevented both on catalog + *
> tables and on user tables declared as additional catalog + * tables.
> 
> Really?

Yes, I think so, c.f. ATRewriteTables():        /*         * We don't support rewriting of system catalogs; there are
too        * many corner cases and too little benefit.  In particular this         * is certainly not going to work for
mappedcatalogs.         */        if (IsSystemRelation(OldHeap))            ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),                    errmsg("cannot rewrite system relation \"%s\"",
              RelationGetRelationName(OldHeap))));
 
        if (RelationIsUsedAsCatalogTable(OldHeap))            ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),                    errmsg("cannot rewrite table \"%s\" used as a catalog
table",                           RelationGetRelationName(OldHeap))));
 

Do you see situations where that's not sufficient? It's not even
dependent on allow_system_table_mods ...

> My eyes are starting to glaze over, so really stopping here.

There's quite a bit to do from that point, so I think you've more than
done your duty... Thanks!

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.8

From
David Rowley
Date:
On Wed, Dec 11, 2013 at 1:11 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Committed #1 (again).  Regarding this:


This introduced a new compiler warning on the visual studios build:
  d:\postgres\b\src\backend\utils\cache\relcache.c(3958): warning C4715: 'RelationGetIndexAttrBitmap' : not all control paths return a value [D:\Postgres\b\postgres.vcxproj]

The attached patch fixes it.

Regards

David Rowley
 
+       /* XXX: we could also do this unconditionally, the space is used anyway
+       if (copy_oid)
+               HeapTupleSetOid(key_tuple, HeapTupleGetOid(tp));

I would like to put in a big +1 for doing that unconditionally.  I
didn't make that change before committing, but I think it'd be a very
good idea.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: logical changeset generation v6.8

From
Andres Freund
Date:
On 2013-12-13 20:58:24 +1300, David Rowley wrote:
> On Wed, Dec 11, 2013 at 1:11 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> This introduced a new compiler warning on the visual studios build:
>   d:\postgres\b\src\backend\utils\cache\relcache.c(3958): warning C4715:
> 'RelationGetIndexAttrBitmap' : not all control paths return a value
> [D:\Postgres\b\postgres.vcxproj]
> 
> The attached patch fixes it.

I thought we'd managed to get elog(ERROR) properly annotated as noreturn
on msvc as well?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.8

From
David Rowley
Date:
On Sat, Dec 14, 2013 at 12:12 AM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2013-12-13 20:58:24 +1300, David Rowley wrote:
> On Wed, Dec 11, 2013 at 1:11 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> This introduced a new compiler warning on the visual studios build:
>   d:\postgres\b\src\backend\utils\cache\relcache.c(3958): warning C4715:
> 'RelationGetIndexAttrBitmap' : not all control paths return a value
> [D:\Postgres\b\postgres.vcxproj]
>
> The attached patch fixes it.

I thought we'd managed to get elog(ERROR) properly annotated as noreturn
on msvc as well?


It looks like this is down to the elog macro, where the elevel is being assigned to a variable elevel_ then we're only doing pg_unreachable(); if elevel_ >= ERROR. The compiler must not be confident enough to optimise out the if condition even though the elevel is not changed after it is set from the constant.

Regards

David Rowley
 
Greetings,

Andres Freund

--
 Andres Freund                     http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: logical changeset generation v6.8

From
Robert Haas
Date:
On Wed, Dec 11, 2013 at 7:08 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> I don't think there's much point in including "remapping" in all of
>> the error messages.  It adds burden for translators and users won't
>> know what a remapping file is anyway.
>
> It helps in locating wich part of the code caused a problem. I utterly
> hate to get reports with error messages that I can't correlate to a
> sourcefile. Yes, I know that verbose error output exists, but usually
> you don't get it that way... That said, I'll try to make the messages
> simpler.

Well, we could adopt a policy of not making messages originating from
different locations in the code the same.  However, it looks to me
like that's NOT the current policy, because some care has been taken
to reuse messages rather than distinguish them.  There's no hard and
fast rule here, because some cases are distinguished, but my gut
feeling is that all of the errors your patch introduces are
sufficiently obscure cases that separate messages with separate
translations are not warranted.  The time when this is likely to fail
is when someone borks the permissions on the data directory, and the
user probably won't need to care exactly which file we can't write.

>> +       if (!TransactionIdIsNormal(xmax))
>> +       {
>> +               /*
>> +                * no xmax is set, can't have any permanent ones, so
>> this check is
>> +                * sufficient
>> +                */
>> +       }
>> +       else if (HEAP_XMAX_IS_LOCKED_ONLY(new_tuple->t_data->t_infomask))
>> +       {
>> +               /* only locked, we don't care */
>> +       }
>> +       else if (!TransactionIdPrecedes(xmax, cutoff))
>> +       {
>> +               /* tuple has been deleted recently, log */
>> +               do_log_xmax = true;
>> +       }
>>
>> Obfuscated.  Rewrite without empty blocks.
>
> I don't understand why having an empty block is less clear than
> including a condition in several branches? Especially if the individual
> conditions might need to be documented?

It's not a coding style we typically use, to my knowledge.  Of course,
what is actually clearest is a matter of opinion, but my own
experience is that such blocks typically end up seeming clearer to me
when the individual comments are joined together into a paragraph that
explains the logic in full sentences what's going on.

>> +       /*
>> +        * Write out buffer everytime we've too many in-memory entries.
>> +        */
>> +       if (state->rs_num_rewrite_mappings >= 1000 /* arbitrary number */)
>> +               logical_heap_rewrite_flush_mappings(state);
>>
>> What happens if the number of items per hash table entry is small but
>> the number of entries is large?
>
> rs_num_rewrite_mappings is the overall number of in-memory mappings, not
> the number of per-entry mappings. That means we flush, even if all
> entries have only a couple of mappings, as soon as 1000 in memory
> entries have been collected. A bit simplistic, but seems okay enough?

Possibly, not sure yet.  I need to examine that logic in more detail,
but I had trouble following it and was hoping the next version would
be better-commented.

>> I'm still unhappy that we're introducing logical decoding slots but no
>> analogue for physical replication.  If we had the latter, would it
>> share meaningful amounts of structure with this?
>
> Yes, I think we could mostly reuse it, we'd probably want to add a field
> or two more (application_name, sync_prio?). I have been wondering
> whether some of the code in replication/logical/logical.c shouldn't be
> in replication/slot.c or similar. So far I've opted for leaving it in
> its current place since it would have to change a bit for a more general
> role.

I strongly favor moving the slot-related code to someplace with "slot"
in the name, and replication/slot.c seems about right.  Even if we
don't extend them to cover non-logical replication in this release,
we'll probably do it eventually, and it'd be better if that didn't
require moving large amounts of code between files.

> I still think that the technical parts of properly supporting persistent
> slots for physical rep aren't that hard, but that the behavioural
> decisions are. I think there are primarily two things for SR that we
> want to prevent using slots:
> a) removal of still used WAL (i.e. maintain knowledge about a standby's
>    last required REDO location)

Check.

> b) make hot_standby_feedback work across disconnections of the walsender
>    connection (i.e peg xmin, not just for catalogs though)

Check; might need to be optional.

> c) Make sure we can transport those across cascading
>    replication.

Not sure I follow.

> once those are there it's also useful to keep a bit more information
> about the state of replicas:
> * write, flush, apply lsn

Check.

> The hard questions that I see are like:
> * How do we manage standby registration? Does the admin have to do that
>   manually? Or does a standby register itself automatically if some config
>   paramter is set?
> * If automatically, how do we deal with the situation that registrant
>   dies before noting his own identifier somewhere persistent? My best idea
>   is a two phase registration process where registration in phase 1 are
>   thrown away after a restart, but yuck.

If you don't know the answers to these questions for the kind of
replication that we have now, then how do you know the answers for
logical replication?  Conversely, what makes the answers that you've
selected for logical replication unsuitable for our existing
replication?

I have to admit that before I saw your design for the logical
replication slots, the problem of making this work for our existing
replication stuff seemed almost intractable to me; I had no idea how
that was going to work. GUCs didn't seem suitable because I thought we
might need some data that is tabular in nature - i.e. configuration
specific to each slot.  And GUC has problems with that sort of thing.
And a new system catalog didn't seem good either, because you are
highly likely to want different configuration on the standby vs. on
the master.  But after I saw your design for the logical slots I said,
dude, get me some of that.  Keeping the data in shared memory,
persisting them across shutdowns, and managing them via either
function calls or the replication command language seems perfect.

Now, in terms of how registration works, whether for physical
replication or logical, it seems to me that the DBA will have to be
responsible for telling each client the name of the slot to which that
client should connect (omitting it at said DBA's option if, as for
physical replication, the slot is not mandatory).  Assuming such a
design, clients could elect for one of two possible behaviors when the
anticipated slot doesn't exist: either they error out, or they create
it.  Either is reasonable; we could even support both.  A third
alternative is to make each client responsible for generating a name,
but I think that's probably not as good.  If we went that route, the
name would presumably be some kind of random string, which will
probably be a lot less usable than a DBA-assigned name.  The client
would first generate it, second save it somewhere persistent (like a
file in the data directory), and third create a slot by that name.  If
the client crashes before creating the slot, it will find the name in
its persistent store after restart and, upon discovering that no such
slot exists, try again to create it.

But note that no matter which of those three options we pick, the
server support really need not work any differently.  I can imagine
any of them being useful and I don't care all that much which one we
end up with.  My personal preference is probably for manual
registration: if the DBA wants to use slots, said DBA will need to set
them up.  This also mitigates the issue you raise in your next point:
what is manually created will naturally also need to be manually
destroyed.

> * How do we deal with the usability wart that people *will* forget to
>   delete a slot when removing a node?

Aside from the point mentioned in the previous paragraph, we don't.
The fact that a standby which gets too far behind can't recover is a
usability wart, too.  I think this is a bit like asking what happens
if a user keeps inserting data of only ephemeral value into a table
and never deletes any of it as they ought to have done.  Well, they'll
be wasting disk space; potentially, they will fill the disk.  Sucks to
be them.  The solution is not to prohibit inserting data.

> * What requirements do we have for transporting knowlede about this
>   across a failover?
>
> I have little hope that we can resolve all that for 9.4.

I wouldn't consider this a requirement for a useful feature, and if it
is a requirement, then how are you solving it for logical replication?For physical replication, there's basically only
oneevent that needs
 
to be considered: master dead, promote standby.  That scenario can
also happen in logical replication... but there's a whole host of
other things that can happen, too.  An individual replication solution
may be single-writer but may, like Slony, allow that write authority
to be moved around.  Or it may have multiple writers.

Suppose A, B, and C replicate between themselves using mult-master
replication for write availability across geographies.  Within each
geo, A physically replicates to A', B to B', and C to C'.   There is
also a reporting server D which replicates selected tables from each
of A, B, and C.  On a particularly bad day, D is switched to replicate
a particular table from B rather than A while at the same time B is
failed over to B' in the midst of a network outage that renders B and
B' only intermittently network-accessible, leading to a high rate of
multi-master conflicts with C that require manual resolution, which
doesn't happen until the following week.  In the meantime, B' is
failed back to B and C is temporarily removed from the multi-master
cluster and allowed to run standalone.  If you can develop a
replication solution that leaves D with the correct data at the end of
all of that, my hat is off to you ... and the problem of getting all
this to work when only physical replication is in use and then number
of possible scenarios is much less ought to seem simple by comparison.

But whether it does or not, I don't see why we have to solve it in
this release.  Following standby promotions, etc. is a whole feature,
or several, unto itself.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.8

From
Andres Freund
Date:
Hi Robert,

On 2013-12-16 00:53:10 -0500, Robert Haas wrote:
> On Wed, Dec 11, 2013 at 7:08 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> I don't think there's much point in including "remapping" in all of
> >> the error messages.  It adds burden for translators and users won't
> >> know what a remapping file is anyway.
> >
> > It helps in locating wich part of the code caused a problem. I utterly
> > hate to get reports with error messages that I can't correlate to a
> > sourcefile. Yes, I know that verbose error output exists, but usually
> > you don't get it that way... That said, I'll try to make the messages
> > simpler.
> 
> Well, we could adopt a policy of not making messages originating from
> different locations in the code the same.  However, it looks to me
> like that's NOT the current policy, because some care has been taken
> to reuse messages rather than distinguish them.

To me that mostly looks like cases where we either don't want to tell
more for various security-ish purposes or where they have been copy and
pasted...

> There's no hard and
> fast rule here, because some cases are distinguished, but my gut
> feeling is that all of the errors your patch introduces are
> sufficiently obscure cases that separate messages with separate
> translations are not warranted.

Perhaps we should just introduce a marker that some such strings are not
to be translated if they are of the unexpected kind. That would probably
make debugging easier too ;)

> I need to examine that logic in more detail,
> but I had trouble following it and was hoping the next version would
> be better-commented.

Yea, I've started expanding the comments about the higher level concerns
- I've been so knee deep in this that I didn't realize they weren't
there.

> > b) make hot_standby_feedback work across disconnections of the walsender
> >    connection (i.e peg xmin, not just for catalogs though)
> 
> Check; might need to be optional.

Yea, I am pretty sure it will. It'd probably pretty nasty to set
min_recovery_apply_delay=7d and force xmin kept to that...

> > c) Make sure we can transport those across cascading
> >    replication.
> 
> Not sure I follow.

Consider a replication scenario like primary <-> standby-1 <->
standby-2. The primary may not only not remove data that standby-1
requires, but also not data that standby-2 needs. Not really necessary
for WAL since that will also reside on standby-1 but definitely for the
xmin horizon.
So standby-1 will need to signal not only his own needs, but also of the
nodes below.

> > The hard questions that I see are like:
> > * How do we manage standby registration? Does the admin have to do that
> >   manually? Or does a standby register itself automatically if some config
> >   paramter is set?
> > * If automatically, how do we deal with the situation that registrant
> >   dies before noting his own identifier somewhere persistent? My best idea
> >   is a two phase registration process where registration in phase 1 are
> >   thrown away after a restart, but yuck.
>
> If you don't know the answers to these questions for the kind of
> replication that we have now, then how do you know the answers for
> logical replication?  Conversely, what makes the answers that you've
> selected for logical replication unsuitable for our existing
> replication?

There's a pretty fundamental difference imo - with the logical decoding
stuff we only supply support for change producing nodes, with physical
rep we supply both.
There's no need to decide about the way node ids are stored in in-core logical
rep. consumers since there are no in-core ones. Yet. Also, physical rep
by now is a pretty established thing, we need to be much more careful
about compatibility there.

> I have to admit that before I saw your design for the logical
> replication slots, the problem of making this work for our existing
> replication stuff seemed almost intractable to me; I had no idea how
> that was going to work.

Believe me, it caused me some headaches to deceive it for decoding
too. Oh, and I think I watched just about all episodes of some stupid TV
show during it ;)

> Keeping the data in shared memory,
> persisting them across shutdowns, and managing them via either
> function calls or the replication command language seems perfect.

Thanks. I think the concept has quite some merits. The implementation is
a bit simplistic atm, we e.g. might want to work harder at coalescing
fsync()s and such, but that's a further step when we see whether it's
worthwile in the real world.

> Now, in terms of how registration works, whether for physical
> replication or logical, it seems to me that the DBA will have to be
> responsible for telling each client the name of the slot to which that
> client should connect (omitting it at said DBA's option if, as for
> physical replication, the slot is not mandatory).

It seems reasonable to me to reuse the application_name for the slot's
name, similar to the way it's used for synchronous rep. It seems odd to
use two different ways to identify nodes. t should probably only be part
of final slot name though, with the rest being autogenerated.

> Assuming such a design, clients could elect for one of two possible
> behaviors when the anticipated slot doesn't exist: either they error
> out, or they create it.  Either is reasonable; we could even support
> both.

For physical rep, I don't see too much argument for not autogenerating
it. The one I can see it that it makes accidentally changing a slots
name easier, with the consequence of leaving a unused slot around.

>  Athird
> alternative is to make each client responsible for generating a name,
> but I think that's probably not as good.  If we went that route, the
> name would presumably be some kind of random string, which will
> probably be a lot less usable than a DBA-assigned name.  The client
> would first generate it, second save it somewhere persistent (like a
> file in the data directory), and third create a slot by that name.  If
> the client crashes before creating the slot, it will find the name in
> its persistent store after restart and, upon discovering that no such
> slot exists, try again to create it.

I thought we'd need to go that route for a while, but I struggled
exactly with the kind of races you desribe. I'd already toyed with ideas
of making slots "ephemeral" intially, until they get confirmed by the
standby. Essentially reinventing 2PC...
Not needing to solve those problems sounds like a good idea.

> But note that no matter which of those three options we pick, the
> server support really need not work any differently.

Yea, that part isn't worrying me overly much - there's really not much
beside passing the slot name before/in START_REPLICATION that needs to
be done.

> > * How do we deal with the usability wart that people *will* forget to
> >   delete a slot when removing a node?
> 
> Aside from the point mentioned in the previous paragraph, we don't.
> The fact that a standby which gets too far behind can't recover is a
> usability wart, too.  I think this is a bit like asking what happens
> if a user keeps inserting data of only ephemeral value into a table
> and never deletes any of it as they ought to have done.  Well, they'll
> be wasting disk space; potentially, they will fill the disk.  Sucks to
> be them.  The solution is not to prohibit inserting data.

I am perfectly happy with taking that stance - in previous discussions
some (most notably Peter G.) argued ardently against it though.

I think we need to improve the monitoring facilities a bit, and that
should be it. Like
* expose xmin in pg_stat_activity, pg_prepared_xacts, pg_replication_slots (or whatever it's going to be called)
* expose the last restartpoint's redo pointer in pg_stat_replication, pg_replication_slots

That said, the consequences can be a bit harsher than a full disk - the
anti-wraparound security measures might kick in requiring a restart into
single user mode. That's way more confusing than cleaning up a bit of
space on the disk.

> > * What requirements do we have for transporting knowlede about this
> >   across a failover?

> I wouldn't consider this a requirement for a useful feature, and if it
> is a requirement, then how are you solving it for logical replication?

Failover for physical rep and for logical rep are pretty different
beasts imo - I don't think they can easily be compared. Especially since
we won't deliver a full logical rep solution and as you say it will look
pretty different depending on the individual solution.

Consider what happens though, if you promote a node for physical rep. As
soon as it's promoted, it will accept writes and then start a
checkpoint. Unless other standbys have started to follow that node
before either that checkpoint happens (removing WAL) or
autovacuuming/hot-pruning is performed (creating recovery conflicts),
we'll possibly loose the data required to let the standbys follow the
promotion. Note that wal_keep_segments and vacuum_defer_cleanup_age both
sorta work for that...

Could somebody please deliver me a time dilation device?

Regards,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.8

From
Robert Haas
Date:
On Mon, Dec 16, 2013 at 6:01 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> There's no hard and
>> fast rule here, because some cases are distinguished, but my gut
>> feeling is that all of the errors your patch introduces are
>> sufficiently obscure cases that separate messages with separate
>> translations are not warranted.
>
> Perhaps we should just introduce a marker that some such strings are not
> to be translated if they are of the unexpected kind. That would probably
> make debugging easier too ;)

Well, we have that: it's called elog.  But that doesn't seem like the
right thing here.

>> > b) make hot_standby_feedback work across disconnections of the walsender
>> >    connection (i.e peg xmin, not just for catalogs though)
>>
>> Check; might need to be optional.
>
> Yea, I am pretty sure it will. It'd probably pretty nasty to set
> min_recovery_apply_delay=7d and force xmin kept to that...

Yes, that would be... unfortunate.

>> > c) Make sure we can transport those across cascading
>> >    replication.
>>
>> Not sure I follow.
>
> Consider a replication scenario like primary <-> standby-1 <->
> standby-2. The primary may not only not remove data that standby-1
> requires, but also not data that standby-2 needs. Not really necessary
> for WAL since that will also reside on standby-1 but definitely for the
> xmin horizon.
> So standby-1 will need to signal not only his own needs, but also of the
> nodes below.

True.

>> > The hard questions that I see are like:
>> > * How do we manage standby registration? Does the admin have to do that
>> >   manually? Or does a standby register itself automatically if some config
>> >   paramter is set?
>> > * If automatically, how do we deal with the situation that registrant
>> >   dies before noting his own identifier somewhere persistent? My best idea
>> >   is a two phase registration process where registration in phase 1 are
>> >   thrown away after a restart, but yuck.
>>
>> If you don't know the answers to these questions for the kind of
>> replication that we have now, then how do you know the answers for
>> logical replication?  Conversely, what makes the answers that you've
>> selected for logical replication unsuitable for our existing
>> replication?
>
> There's a pretty fundamental difference imo - with the logical decoding
> stuff we only supply support for change producing nodes, with physical
> rep we supply both.

I'm not sure I follow this.  "Both" what and what?

> There's no need to decide about the way node ids are stored in in-core logical
> rep. consumers since there are no in-core ones. Yet.

I don't know that we have or need to make any judgements about how to
store node IDs.  You have decided that slots have names, and I see no
problem there.

> Also, physical rep
> by now is a pretty established thing, we need to be much more careful
> about compatibility there.

I don't think we should change anything in backward-incompatible
fashion.  If we add any new behavior, it'd surely be optional.

> I think we need to improve the monitoring facilities a bit, and that
> should be it. Like
> * expose xmin in pg_stat_activity, pg_prepared_xacts,
>   pg_replication_slots (or whatever it's going to be called)
> * expose the last restartpoint's redo pointer in pg_stat_replication, pg_replication_slots

+1.

> That said, the consequences can be a bit harsher than a full disk - the
> anti-wraparound security measures might kick in requiring a restart into
> single user mode. That's way more confusing than cleaning up a bit of
> space on the disk.

Yes, true.  I'm not sure what approach to that problem is best.  It's
long seemed to me that well before we get to the point of shutting
down the whole cluster we ought to just start killing sessions with
old xmins.  But that doesn't generalize well to prepared transactions,
which can't just be rolled back or committed without guidance; and
killing slots seems a bit dicey too.

> Consider what happens though, if you promote a node for physical rep. As
> soon as it's promoted, it will accept writes and then start a
> checkpoint. Unless other standbys have started to follow that node
> before either that checkpoint happens (removing WAL) or
> autovacuuming/hot-pruning is performed (creating recovery conflicts),
> we'll possibly loose the data required to let the standbys follow the
> promotion. Note that wal_keep_segments and vacuum_defer_cleanup_age both
> sorta work for that...

True.

> Could somebody please deliver me a time dilation device?

Upon reflection, I am less concerned with actually having physical
slots in this release than I am with making sure we're not boxing
ourselves into a corner that will make them hard to add later.  If
we've got a clear design that can be generalized to that case, but the
SMOP required exceeds what can be done in the time available, I am OK
to punt it.  But I am not sure we're at that point yet.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.8

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Dec 16, 2013 at 6:01 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> Perhaps we should just introduce a marker that some such strings are not
>> to be translated if they are of the unexpected kind. That would probably
>> make debugging easier too ;)

> Well, we have that: it's called elog.  But that doesn't seem like the
> right thing here.

errmsg_internal?
        regards, tom lane



Re: logical changeset generation v6.8

From
Robert Haas
Date:
On Mon, Dec 16, 2013 at 12:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Mon, Dec 16, 2013 at 6:01 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>>> Perhaps we should just introduce a marker that some such strings are not
>>> to be translated if they are of the unexpected kind. That would probably
>>> make debugging easier too ;)
>
>> Well, we have that: it's called elog.  But that doesn't seem like the
>> right thing here.
>
> errmsg_internal?

There's that, too.  But again, these messages are not can't-happen
scenarios.  The argument is just whether to reuse existing error
message text (like "could not write file") or invent a new variation
(like "could not write remapping file").  Andres' argument (which is
valid) is that distinguished messages make it easier to troubleshoot
without needing to turn on verbose error messages.  My argument (which
I think is also valid) is that a user isn't likely to know what a
remapping file is, and having more messages increases the translation
burden.  Is there a project policy on this topic?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6.8

From
Alvaro Herrera
Date:
Robert Haas escribió:

> There's that, too.  But again, these messages are not can't-happen
> scenarios.  The argument is just whether to reuse existing error
> message text (like "could not write file") or invent a new variation
> (like "could not write remapping file").  Andres' argument (which is
> valid) is that distinguished messages make it easier to troubleshoot
> without needing to turn on verbose error messages.  My argument (which
> I think is also valid) is that a user isn't likely to know what a
> remapping file is, and having more messages increases the translation
> burden.  Is there a project policy on this topic?

I would vote for a generic "could not write file %s" where the %s lets
the troubleshooter know the path of the file, and thus in what context
it is being read.  We already have a similar case where slru.c reports
error as pertaining to "transaction 12345" but the path is
"pg_subtrans/xyz" or multixact etc; while it doesn't explicitely say
what module is raising the error, it's pretty clear from the path.

Would that not work here?

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: logical changeset generation v6.8

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> There's that, too.  But again, these messages are not can't-happen
> scenarios.  The argument is just whether to reuse existing error
> message text (like "could not write file") or invent a new variation
> (like "could not write remapping file").

As long as the message includes the file name, which it surely oughta,
I don't see that we need any explanation of what Postgres thinks the
file is for.  If someone cares about that they can reverse-engineer
it from the file name; while as you said upthread, most of the time
the directory path is going to be the key piece of useful information.

So +1 for "could not write file".

> Andres' argument (which is
> valid) is that distinguished messages make it easier to troubleshoot
> without needing to turn on verbose error messages.  My argument (which
> I think is also valid) is that a user isn't likely to know what a
> remapping file is, and having more messages increases the translation
> burden.  Is there a project policy on this topic?

I think Andres' argument is a thinly veiled version of "let's put the
routine name into the message text", which there definitely is project
policy against (see 49.3.13 in the message style guide).  If you want to
know the code location where the error was thrown, the answer is to get
a verbose log, not to put identifying information into the user-facing
message text.  And this is only partially-identifying information,
which seems like the worst of both worlds: you've got confused users and
overworked translators, and you still don't know exactly where it was
thrown from.
        regards, tom lane



Re: logical changeset generation v6.8

From
Andres Freund
Date:
On 2013-12-16 00:53:10 -0500, Robert Haas wrote:
> > Yes, I think we could mostly reuse it, we'd probably want to add a field
> > or two more (application_name, sync_prio?). I have been wondering
> > whether some of the code in replication/logical/logical.c shouldn't be
> > in replication/slot.c or similar. So far I've opted for leaving it in
> > its current place since it would have to change a bit for a more general
> > role.
> 
> I strongly favor moving the slot-related code to someplace with "slot"
> in the name, and replication/slot.c seems about right.  Even if we
> don't extend them to cover non-logical replication in this release,
> we'll probably do it eventually, and it'd be better if that didn't
> require moving large amounts of code between files.

Any opinion on the storage location of the slot files? It's currently
pg_llog/$slotname/state[.tmp]. It's a directory so we have a location
during logical decoding to spill data to...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6.8

From
Robert Haas
Date:
On Tue, Dec 17, 2013 at 7:48 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-12-16 00:53:10 -0500, Robert Haas wrote:
>> > Yes, I think we could mostly reuse it, we'd probably want to add a field
>> > or two more (application_name, sync_prio?). I have been wondering
>> > whether some of the code in replication/logical/logical.c shouldn't be
>> > in replication/slot.c or similar. So far I've opted for leaving it in
>> > its current place since it would have to change a bit for a more general
>> > role.
>>
>> I strongly favor moving the slot-related code to someplace with "slot"
>> in the name, and replication/slot.c seems about right.  Even if we
>> don't extend them to cover non-logical replication in this release,
>> we'll probably do it eventually, and it'd be better if that didn't
>> require moving large amounts of code between files.
>
> Any opinion on the storage location of the slot files? It's currently
> pg_llog/$slotname/state[.tmp]. It's a directory so we have a location
> during logical decoding to spill data to...

pg_replslot?  pg_replication_slot?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v6

From
Magnus Hagander
Date:
On Mon, Sep 23, 2013 at 7:03 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Mon, Sep 23, 2013 at 9:54 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> I still find it wierd/inconsistent to have:
> * pg_receivexlog
> * pg_recvlogical
> binaries, even from the same source directory. Why once "pg_recv" and
> once "pg_receive"?

+1

Digging up a really old thread since I just got annoyed by the inconsistent naming the first time myself :)

I can't find that this discussion actually came to a proper consensus, but I may be missing something. Did we go with pg_recvlogical just because we couldn't decide on a better name, or did we intentionally decide it was the best?

I definitely think pg_receivelogical would be a better name, for consistency (because it's way too late to rename pg_receivexlog of course - once released that can't really chance. Which is why *if* we want to change the name of pg_recvxlog we have a few more days to make a decision..)


--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: logical changeset generation v6

From
Andres Freund
Date:
On 2014-04-24 09:39:21 +0200, Magnus Hagander wrote:
> I can't find that this discussion actually came to a proper consensus, but
> I may be missing something. Did we go with pg_recvlogical just because we
> couldn't decide on a better name, or did we intentionally decide it was the
> best?

I went with pg_recvlogical because that's where the (small) majority
seemed to be. Even if I was unconvinced. There were so many outstanding
big fights at that point that I didn't want to spend my time on this ;)

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v6

From
Magnus Hagander
Date:

On Thu, Apr 24, 2014 at 9:43 AM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2014-04-24 09:39:21 +0200, Magnus Hagander wrote:
> I can't find that this discussion actually came to a proper consensus, but
> I may be missing something. Did we go with pg_recvlogical just because we
> couldn't decide on a better name, or did we intentionally decide it was the
> best?

I went with pg_recvlogical because that's where the (small) majority
seemed to be. Even if I was unconvinced. There were so many outstanding
big fights at that point that I didn't want to spend my time on this ;)


I was guessing something like the second part there, which is why I figured this would be a good time to bring this fight back up to the surface ;) 

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: logical changeset generation v6

From
Andres Freund
Date:
On 2014-04-24 09:46:07 +0200, Magnus Hagander wrote:
> On Thu, Apr 24, 2014 at 9:43 AM, Andres Freund <andres@2ndquadrant.com>wrote:
> 
> > On 2014-04-24 09:39:21 +0200, Magnus Hagander wrote:
> > > I can't find that this discussion actually came to a proper consensus,
> > but
> > > I may be missing something. Did we go with pg_recvlogical just because we
> > > couldn't decide on a better name, or did we intentionally decide it was
> > the
> > > best?
> >
> > I went with pg_recvlogical because that's where the (small) majority
> > seemed to be. Even if I was unconvinced. There were so many outstanding
> > big fights at that point that I didn't want to spend my time on this ;)
> >
> >
> I was guessing something like the second part there, which is why I figured
> this would be a good time to bring this fight back up to the surface ;)

I have to admit that I still don't care too much. By now I'd tentatively
want to stay with the current name because that's what I got used to,
but if somebody else has strong opinions and finds concensus...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services