Thread: logical changeset generation v5

logical changeset generation v5

From
Andres Freund
Date:
Hi!

I am rather pleased to announce the next version of the changeset
extraction patchset. Thanks to help from a large number of people I
think we are slowly getting to the point where it is getting
committable.

Since the last submitted version
(20121115002746.GA7692@awork2.anarazel.de) a large number of fixes and
the result of good amount of review has been added to the tree. All
bugs known to me have been fixed.

Fixes include:
* synchronous replication support
* don't peg the xmin for user tables, do it only for catalog ones.
* arbitrarily large transaction support by spilling large transactions to disk
* spill snapshots to disk, so we can restart without waiting for a new snapshot to be built
* Don't read all WAL from the establishment of a logical slot
* tests via SQL interface to changeset extraction

The todo list includes:
* morph the "logical slot" interface into being "replication slots" that can also be used by streaming replication
* move some more code from snapbuild.c to decode.c to remove a largely duplicated switch
* do some more header/comment cleanup & clarification
* move pg_receivellog into its own directory in src/bin or contrib/.
* user/developer level documentation

The patch series currently has two interfaces to logical decoding. One -
which is primarily useful for pg_regress style tests and playing around
- is SQL based, the other one uses a walsender replication connection.

A quick demonstration of the SQL interface (server needs to be started
with wal_level = logical and max_logical_slots > 0):
=# CREATE EXTENSION test_logical_decoding;
=# SELECT * FROM init_logical_replication('regression_slot', 'test_decoding');   slotname     | xlog_position 
-----------------+---------------regression_slot | 0/17D5908
(1 row)

=# CREATE TABLE foo(id serial primary key, data text);

=# INSERT INTO foo(data) VALUES(1);

=# UPDATE foo SET id = -id, data = ':'||data;

=# DELETE FROM foo;

=# DROP TABLE foo;

=# SELECT * FROM start_logical_replication('regression_slot', 'now', 'hide-xids', '0');location  | xid |
                     data
 
-----------+-----+--------------------------------------------------------------------------------0/17D59B8 | 695 |
BEGIN0/17D59B8| 695 | COMMIT0/17E8B58 | 696 | BEGIN0/17E8B58 | 696 | table "foo": INSERT: id[int4]:1
data[text]:10/17E8B58| 696 | COMMIT0/17E8CA8 | 697 | BEGIN0/17E8CA8 | 697 | table "foo": UPDATE: old-pkey: id[int4]:1
new-tuple:id[int4]:-1 data[text]::10/17E8CA8 | 697 | COMMIT0/17E8E50 | 698 | BEGIN0/17E8E50 | 698 | table "foo":
DELETE:id[int4]:-10/17E8E50 | 698 | COMMIT0/17E9058 | 699 | BEGIN0/17E9058 | 699 | COMMIT
 
(13 rows)

=# SELECT * FROM pg_stat_logical_decoding ;   slot_name    |    plugin     | database | active | xmin |
restart_decoding_lsn
 
-----------------+---------------+----------+--------+------+----------------------regression_slot | test_decoding |
12042| f      |  695 | 0/17D58D0
 
(1 row)

=# SELECT * FROM stop_logical_replication('regression_slot');stop_logical_replication
--------------------------                       0

The walsender interface has the same calls
INIT_LOGICAL_REPLICATION 'slot' 'plugin';
START_LOGICAL_REPLICATION 'slot' restart_lsn [(option value)*];
STOP_LOGICAL_REPLICATION 'slot';

The only difference is that START_LOGICAL_REPLICATION can stream changes
and it can support synchronous replication.

The output seen in the 'data' column is produced by a so called 'output
plugin' which users of the facility can write to suit their needs. They
can be written by implementing 5 functions in the shared object that's
passed to init_logical_replication() above:
* pg_decode_init (optional)
* pg_decode_begin_txn
* pg_decode_change
* pg_decode_commit_txn
* pg_decode_cleanup (optional)

The most interesting function pg_decode_change get's passed a structure
containing old/new versions of the row, the 'struct Relation' belonging
to it and metainformation about the transaction.

The output plugin can rely on syscache lookups et al. to decode the
changed tuple in whatever fashion it wants.

I'd like to invite reviewers to first look at:
* the output plugin interface
* the walsender/SRF interface
* patch 12 which contains most of the code

When reading the code, the information flow during decoding might be
interesting:
---------------         +---------------+         | XLogReader    |         +---------------+                 |
 XLOG Records                 |                 v         +---------------+         | decode.c      |
+---------------+           |       |            |       |            v       |
 
+---------------+    |
| snapbuild.c   |  HeapTupleData
+---------------+    |            |       | catalog snapshots  |            |       |            v       v
+---------------+ |reorderbuffer.c|  +---------------+                |       HeapTuple & Metadata                |
          v         +---------------+  | Output Plugin |  +---------------+                |         Whatever you want
             |                v         +---------------+  | Output Handler|  |               |  |WalSnd or SRF  |
+---------------+
---------------


Overview of the attached patches:
0001: indirect toast tuples; required but submitted independently
0002: functions for testing; not required,
0003: (tablespace, filenode) syscache; required
0004: RelationMapFilenodeToOid: required, simple
0005: pg_relation_by_filenode() function; not required but useful
0006: Introduce InvalidCommandId: required, simple
0007: Adjust Satisfies* interface: required, mechanical,
0008: Allow walsender to attach to a database: required, needs review
0009: New GetOldestXmin() parameter; required, pretty boring
0010: Log xl_running_xact regularly in the bgwriter: required
0011: make fsync_fname() public; required, needs to be in a different file
0012: Relcache support for an Relation's primary key: required
0013: Actual changeset extraction; required
0014: Output plugin demo; not required (except for testing) but useful
0015: Add pg_receivellog program: not required but useful
0016: Add test_logical_decoding extension; not required, but contains     the tests for the feature. Uses 0014
0017: Snapshot building docs; not required

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: changeset generation v5-01 - Patches & git tree

From
Andres Freund
Date:
The git tree is at:
git://git.postgresql.org/git/users/andresfreund/postgres.git branch xlog-decoding-rebasing-cf4
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-decoding-rebasing-cf4

On 2013-06-15 00:48:17 +0200, Andres Freund wrote:
> Overview of the attached patches:
> 0001: indirect toast tuples; required but submitted independently
> 0002: functions for testing; not required,
> 0003: (tablespace, filenode) syscache; required
> 0004: RelationMapFilenodeToOid: required, simple
> 0005: pg_relation_by_filenode() function; not required but useful
> 0006: Introduce InvalidCommandId: required, simple
> 0007: Adjust Satisfies* interface: required, mechanical,
> 0008: Allow walsender to attach to a database: required, needs review
> 0009: New GetOldestXmin() parameter; required, pretty boring
> 0010: Log xl_running_xact regularly in the bgwriter: required
> 0011: make fsync_fname() public; required, needs to be in a different file
> 0012: Relcache support for an Relation's primary key: required
> 0013: Actual changeset extraction; required
> 0014: Output plugin demo; not required (except for testing) but useful
> 0015: Add pg_receivellog program: not required but useful
> 0016: Add test_logical_decoding extension; not required, but contains
>       the tests for the feature. Uses 0014
> 0017: Snapshot building docs; not required

Version v5-01 attached

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: changeset generation v5-01 - Patches & git tree

From
Kevin Grittner
Date:
Andres Freund <andres@2ndquadrant.com> wrote:

>> 0007: Adjust Satisfies* interface: required, mechanical,

> Version v5-01 attached

I'm still working on a review and hope to post something more
substantive by this weekend, but when applying patches in numeric
order, this one did not compile cleanly.

gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g
-I../../../../src/include-D_GNU_SOURCE -I/usr/include/libxml2   -c -o allpaths.o allpaths.c -MMD -MP -MF
.deps/allpaths.Po
vacuumlazy.c: In function ‘heap_page_is_all_visible’:
vacuumlazy.c:1725:3: warning: passing argument 1 of ‘HeapTupleSatisfiesVacuum’ from incompatible pointer type [enabled
bydefault] 
In file included from vacuumlazy.c:61:0:
../../../src/include/utils/tqual.h:84:20: note: expected ‘HeapTuple’ but argument is of type ‘HeapTupleHeader’

Could you post a new version of that?

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: changeset generation v5-01 - Patches & git tree

From
Andres Freund
Date:
Hi Kevin!

On 2013-06-20 15:57:07 -0700, Kevin Grittner wrote:
> Andres Freund <andres@2ndquadrant.com> wrote:
>
> >> 0007: Adjust Satisfies* interface: required, mechanical,
>
> > Version v5-01 attached
>
> I'm still working on a review and hope to post something more
> substantive by this weekend

Cool!

>, but when applying patches in numeric
> order, this one did not compile cleanly.
>
> gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g
-I../../../../src/include-D_GNU_SOURCE -I/usr/include/libxml2   -c -o allpaths.o allpaths.c -MMD -MP -MF
.deps/allpaths.Po
> vacuumlazy.c: In function ‘heap_page_is_all_visible’:
> vacuumlazy.c:1725:3: warning: passing argument 1 of ‘HeapTupleSatisfiesVacuum’ from incompatible pointer type
[enabledby default] 
> In file included from vacuumlazy.c:61:0:
> ../../../src/include/utils/tqual.h:84:20: note: expected ‘HeapTuple’ but argument is of type ‘HeapTupleHeader’
>
> Could you post a new version of that?

Hrmpf. There was one hunk in 0013 instead of 0007.

I made sure that every commit again applies and compiles cleanly. git
rebase -i --exec to the rescue.

Found two other issues:
* recptr not assigned in 0010
* unsafe use of non-volatile variable across longjmp() 0013

Pushed and attached.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: changeset generation v5-01 - Patches & git tree

From
Kevin Grittner
Date:
Andres Freund <andres@2ndquadrant.com> wrote:

> The git tree is at:
> git://git.postgresql.org/git/users/andresfreund/postgres.git branch
> xlog-decoding-rebasing-cf4
>
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-decoding-rebasing-cf4
>
> On 2013-06-15 00:48:17 +0200, Andres Freund wrote:
>> Overview of the attached patches:
>> 0001: indirect toast tuples; required but submitted independently
>> 0002: functions for testing; not required,
>> 0003: (tablespace, filenode) syscache; required
>> 0004: RelationMapFilenodeToOid: required, simple
>> 0005: pg_relation_by_filenode() function; not required but useful
>> 0006: Introduce InvalidCommandId: required, simple
>> 0007: Adjust Satisfies* interface: required, mechanical,
>> 0008: Allow walsender to attach to a database: required, needs review
>> 0009: New GetOldestXmin() parameter; required, pretty boring
>> 0010: Log xl_running_xact regularly in the bgwriter: required
>> 0011: make fsync_fname() public; required, needs to be in a different file
>> 0012: Relcache support for an Relation's primary key: required
>> 0013: Actual changeset extraction; required
>> 0014: Output plugin demo; not required (except for testing) but useful
>> 0015: Add pg_receivellog program: not required but useful
>> 0016: Add test_logical_decoding extension; not required, but contains
>>       the tests for the feature. Uses 0014
>> 0017: Snapshot building docs; not required
>
> Version v5-01 attached

Confirmed that all 17 patch files now apply cleanly, and that `make
check-world` builds cleanly after each patch in turn.

Reviewing and testing the final result now.  If that all looks
good, will submit a separate review of each patch.

Simon, do you want to do the final review and commit after I do
each piece?

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: changeset generation v5-01 - Patches & git tree

From
Kevin Grittner
Date:
Kevin Grittner <kgrittn@ymail.com> wrote:

> Confirmed that all 17 patch files now apply cleanly, and that `make
> check-world` builds cleanly after each patch in turn.

Just to be paranoid, I did one last build with all 17 patch files
applied to 7dfd5cd21c0091e467b16b31a10e20bbedd0a836 using this
line:

make maintainer-clean ; ./configure --prefix=$PWD/Debug --enable-debug --enable-cassert --enable-depend --with-libxml
--with-libxslt--with-openssl --with-perl --with-python && make -j4 world 

and it died with this:

gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g
-I../../../src/interfaces/libpq-I../../../src/include -D_GNU_SOURCE -I/usr/include/libxml2   -c -o pg_receivexlog.o
pg_receivexlog.c-MMD -MP -MF .deps/pg_receivexlog.Po 
gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -I. -I.
-I../../../src/interfaces/libpq-I../../../src/bin/pg_dump -I../../../src/include -D_GNU_SOURCE -I/usr/include/libxml2  
-c-o mainloop.o mainloop.c -MMD -MP -MF .deps/mainloop.Po 
gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g
pg_receivellog.oreceivelog.o streamutil.o  -L../../../src/port -lpgport -L../../../src/common -lpgcommon
-L../../../src/interfaces/libpq-lpq -L../../../src/port -L../../../src/common -L/usr/lib  -Wl,--as-needed
-Wl,-rpath,'/home/kgrittn/pg/master/Debug/lib',--enable-new-dtags -lpgport -lpgcommon -lxslt -lxml2 -lssl -lcrypto -lz
-lreadline-lcrypt -ldl -lm  -o pg_receivellog 
gcc: error: pg_receivellog.o: No such file or directory
make[3]: *** [pg_receivellog] Error 1
make[3]: Leaving directory `/home/kgrittn/pg/master/src/bin/pg_basebackup'
make[2]: *** [all-pg_basebackup-recurse] Error 2
make[2]: *** Waiting for unfinished jobs....

It works with this patch-on-patch:

diff --git a/src/bin/pg_basebackup/Makefile b/src/bin/pg_basebackup/Makefile
index a41b73c..18d02f3 100644
--- a/src/bin/pg_basebackup/Makefile
+++ b/src/bin/pg_basebackup/Makefile
@@ -42,6 +42,7 @@ installdirs:
 uninstall:
    rm -f '$(DESTDIR)$(bindir)/pg_basebackup$(X)'
    rm -f '$(DESTDIR)$(bindir)/pg_receivexlog$(X)'
+   rm -f '$(DESTDIR)$(bindir)/pg_receivellog$(X)'
 
 clean distclean maintainer-clean:
-   rm -f pg_basebackup$(X) pg_receivexlog$(X) $(OBJS) pg_basebackup.o pg_receivexlog.o pg_receivellog.o
+   rm -f pg_basebackup$(X) pg_receivexlog$(X) pg_receivellog$(X) $(OBJS) pg_basebackup.o pg_receivexlog.o
pg_receivellog.o

It appears to be an omission from file 0015.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: changeset generation v5-01 - Patches & git tree

From
Kevin Grittner
Date:
Kevin Grittner <kgrittn@ymail.com> wrote:

>  uninstall:

>     rm -f '$(DESTDIR)$(bindir)/pg_basebackup$(X)'
>     rm -f '$(DESTDIR)$(bindir)/pg_receivexlog$(X)'
> +   rm -f '$(DESTDIR)$(bindir)/pg_receivellog$(X)'

Oops.  That part is not needed.

>  clean distclean maintainer-clean:
> -   rm -f pg_basebackup$(X) pg_receivexlog$(X) $(OBJS) pg_basebackup.o pg_receivexlog.o pg_receivellog.o
> +   rm -f pg_basebackup$(X) pg_receivexlog$(X) pg_receivellog$(X) $(OBJS) pg_basebackup.o pg_receivexlog.o
pg_receivellog.o

Just that part.


--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: changeset generation v5-01 - Patches & git tree

From
Kevin Grittner
Date:
Andres Freund <andres@2ndquadrant.com> wrote:

> Pushed and attached.

The contrib/test_logical_decoding/sql/ddl.sql script is generating
unexpected results.  For both table_with_pkey and
table_with_unique_not_null, updates of the primary key column are
showing:

old-pkey: id[int4]:0

... instead of the expected value of 2 or -2.

See attached.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

Re: changeset generation v5-01 - Patches & git tree

From
Andres Freund
Date:
On 2013-06-23 08:27:32 -0700, Kevin Grittner wrote:
> Kevin Grittner <kgrittn@ymail.com> wrote:
>
> > Confirmed that all 17 patch files now apply cleanly, and that `make
> > check-world` builds cleanly after each patch in turn.
>
> Just to be paranoid, I did one last build with all 17 patch files
> applied to 7dfd5cd21c0091e467b16b31a10e20bbedd0a836 using this
> line:
>
> make maintainer-clean ; ./configure --prefix=$PWD/Debug --enable-debug --enable-cassert --enable-depend --with-libxml
--with-libxslt--with-openssl --with-perl --with-python && make -j4 world 
>
> and it died with this:
>
> gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g
-I../../../src/interfaces/libpq-I../../../src/include -D_GNU_SOURCE -I/usr/include/libxml2   -c -o pg_receivexlog.o
pg_receivexlog.c-MMD -MP -MF .deps/pg_receivexlog.Po 
> gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -I. -I.
-I../../../src/interfaces/libpq-I../../../src/bin/pg_dump -I../../../src/include -D_GNU_SOURCE -I/usr/include/libxml2  
-c-o mainloop.o mainloop.c -MMD -MP -MF .deps/mainloop.Po 
> gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g
pg_receivellog.oreceivelog.o streamutil.o  -L../../../src/port -lpgport -L../../../src/common -lpgcommon
-L../../../src/interfaces/libpq-lpq -L../../../src/port -L../../../src/common -L/usr/lib  -Wl,--as-needed
-Wl,-rpath,'/home/kgrittn/pg/master/Debug/lib',--enable-new-dtags -lpgport -lpgcommon -lxslt -lxml2 -lssl -lcrypto -lz
-lreadline-lcrypt -ldl -lm  -o pg_receivellog 
> gcc: error: pg_receivellog.o: No such file or directory
> make[3]: *** [pg_receivellog] Error 1
> make[3]: Leaving directory `/home/kgrittn/pg/master/src/bin/pg_basebackup'
> make[2]: *** [all-pg_basebackup-recurse] Error 2
> make[2]: *** Waiting for unfinished jobs....

I have seen that once as well. It's really rather strange since
pg_receivellog.o is a clear prerequisite for pg_receivellog. I couldn't
reproduce it reliably though, even after doing some dozen rebuilds or so.

> It works with this patch-on-patch:

> diff --git a/src/bin/pg_basebackup/Makefile b/src/bin/pg_basebackup/Makefile
> index a41b73c..18d02f3 100644
> --- a/src/bin/pg_basebackup/Makefile
> +++ b/src/bin/pg_basebackup/Makefile
> @@ -42,6 +42,7 @@ installdirs:
>  uninstall:
>     rm -f '$(DESTDIR)$(bindir)/pg_basebackup$(X)'
>     rm -f '$(DESTDIR)$(bindir)/pg_receivexlog$(X)'
> +   rm -f '$(DESTDIR)$(bindir)/pg_receivellog$(X)'
>  
>  clean distclean maintainer-clean:
> -   rm -f pg_basebackup$(X) pg_receivexlog$(X) $(OBJS) pg_basebackup.o pg_receivexlog.o pg_receivellog.o
> +   rm -f pg_basebackup$(X) pg_receivexlog$(X) pg_receivellog$(X) $(OBJS) pg_basebackup.o pg_receivexlog.o
pg_receivellog.o
>
> It appears to be an omission from file 0015.

Yes, both are missing.

> > +   rm -f '$(DESTDIR)$(bindir)/pg_receivellog$(X)'
> Oops.  That part is not needed.

Hm. Why not?

I don't think either hunk has anything to do with that buildfailure
though - can you reproduce the error without?

Thanks,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: changeset generation v5-01 - Patches & git tree

From
Andres Freund
Date:
On 2013-06-23 10:32:05 -0700, Kevin Grittner wrote:
> Andres Freund <andres@2ndquadrant.com> wrote:
>
> > Pushed and attached.
>
> The contrib/test_logical_decoding/sql/ddl.sql script is generating
> unexpected results.  For both table_with_pkey and
> table_with_unique_not_null, updates of the primary key column are
> showing:
>
> old-pkey: id[int4]:0
>
> ... instead of the expected value of 2 or -2.
>
> See attached.

Hm. Any chance this was an incomplete rebuild? I seem to remember having
seen that once because some header dependency wasn't recognized
correctly after applying some patch.

Otherwise, could you give me:
* the version you aplied the patch on
* os/compiler

Because I can't reproduce it, despite some playing around...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: changeset generation v5-01 - Patches & git tree

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2013-06-23 08:27:32 -0700, Kevin Grittner wrote:
>> gcc: error: pg_receivellog.o: No such file or directory
>> make[3]: *** [pg_receivellog] Error 1

> I have seen that once as well. It's really rather strange since
> pg_receivellog.o is a clear prerequisite for pg_receivellog. I couldn't
> reproduce it reliably though, even after doing some dozen rebuilds or so.

What versions of gmake are you guys using?  It wouldn't be the first
time we've tripped over bugs in parallel make.  See for instance
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=1fc698cf14d17a3a8ad018cf9ec100198a339447
        regards, tom lane



Re: changeset generation v5-01 - Patches & git tree

From
Andres Freund
Date:
On 2013-06-23 16:48:41 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2013-06-23 08:27:32 -0700, Kevin Grittner wrote:
> >> gcc: error: pg_receivellog.o: No such file or directory
> >> make[3]: *** [pg_receivellog] Error 1
> 
> > I have seen that once as well. It's really rather strange since
> > pg_receivellog.o is a clear prerequisite for pg_receivellog. I couldn't
> > reproduce it reliably though, even after doing some dozen rebuilds or so.
> 
> What versions of gmake are you guys using?  It wouldn't be the first
> time we've tripped over bugs in parallel make.  See for instance
> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=1fc698cf14d17a3a8ad018cf9ec100198a339447

3.81 here. That was supposed to be the "safe" one, right? At least to
the bugs seen/fixed recently.

Kevin, any chance you still have more log than in the upthread mail
available?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: changeset generation v5-01 - Patches & git tree

From
Kevin Grittner
Date:
Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-06-23 08:27:32 -0700, Kevin Grittner wrote:

>> make maintainer-clean ; ./configure --prefix=$PWD/Debug --enable-debug
>> --enable-cassert --enable-depend --with-libxml --with-libxslt --with-openssl
>> --with-perl --with-python && make -j4 world

>> [ build failure referencing pg_receivellog.o ]

> I have seen that once as well. It's really rather strange since
> pg_receivellog.o is a clear prerequisite for pg_receivellog. I couldn't
> reproduce it reliably though, even after doing some dozen rebuilds or so.
>
>> It works with this patch-on-patch:

>>  clean distclean maintainer-clean:
>> -   rm -f pg_basebackup$(X) pg_receivexlog$(X) $(OBJS) pg_basebackup.o
>> pg_receivexlog.o pg_receivellog.o
>> +   rm -f pg_basebackup$(X) pg_receivexlog$(X) pg_receivellog$(X) $(OBJS)
>> pg_basebackup.o pg_receivexlog.o pg_receivellog.o

>> > +  rm -f '$(DESTDIR)$(bindir)/pg_receivellog$(X)'
>> Oops.  That part is not needed.
>
> Hm. Why not?

Well, I could easily be wrong on just about anything to do with
make files, but on a second look that appeared to be dealing with
eliminating an installed pg_receivellog binary, which is not
created.

> I don't think either hunk has anything to do with that buildfailure
> though - can you reproduce the error without?

I tried that scenario three times and it failed three times.  Then
I made the above changes and it worked.  Then I eliminated the one
on the uninstall target and tried a couple more times and it worked
on both attempts.  The scenario is to have a `make world` build in
the source tree, and run the above line starting with `make
maintainer-clean` and going to `make -j4 world`.

I did notice that without that change to the maintainer-clean
target I did not get a pg_receivellog.Po file in
src/bin/pg_basebackup/.deps/ -- and with it I do.  I admit to being
at about a 1.5 on a 10 point scale of make file competence -- I
just look for patterns used for things similar to what I want to do
and copy without much understanding of what it all means.  :-(  So
when I got an error on pg_receivellog which didn't happen on
pg_receivexlog, I looked for differences -- my suggestion has no
more basis than that and the fact that empirical testing seemed to
show that it worked.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: changeset generation v5-01 - Patches & git tree

From
Kevin Grittner
Date:
Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-06-23 16:48:41 -0400, Tom Lane wrote:
>> Andres Freund <andres@2ndquadrant.com> writes:
>>> On 2013-06-23 08:27:32 -0700, Kevin Grittner wrote:
>>>> gcc: error: pg_receivellog.o: No such file or directory
>>>> make[3]: *** [pg_receivellog] Error 1
>>
>>> I have seen that once as well. It's really rather strange since
>>> pg_receivellog.o is a clear prerequisite for pg_receivellog. I
>>> couldn't reproduce it reliably though, even after doing some
>>> dozen rebuilds or so.
>>
>> What versions of gmake are you guys using?  It wouldn't be the
>> first time we've tripped over bugs in parallel make.  See for
>> instance
>> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=1fc698cf14d17a3a8ad018cf9ec100198a339447
>
> 3.81 here. That was supposed to be the "safe" one, right? At
> least to the bugs seen/fixed recently.

There is no executable named gmake in my distro, but...

kgrittn@Kevin-Desktop:~/pg/master$ make --version
GNU Make 3.81

Which is what I'm using.

> Kevin, any chance you still have more log than in the upthread
> mail available?

Well, I just copied from the console, and that was gone; but
reverting my change I get the same thing.  All console output
attached.  Let me know if you need something else.

Note that the dependency file disappeared:

kgrittn@Kevin-Desktop:~/pg/master$ ll src/bin/pg_basebackup/.deps/
total 24
drwxrwxr-x 2 kgrittn kgrittn 4096 Jun 24 08:57 ./
drwxrwxr-x 4 kgrittn kgrittn 4096 Jun 24 08:57 ../
-rw-rw-r-- 1 kgrittn kgrittn 1298 Jun 24 08:57 pg_basebackup.Po
-rw-rw-r-- 1 kgrittn kgrittn 1729 Jun 24 08:57 pg_receivexlog.Po
-rw-rw-r-- 1 kgrittn kgrittn 1646 Jun 24 08:57 receivelog.Po
-rw-rw-r-- 1 kgrittn kgrittn  953 Jun 24 08:57 streamutil.Po

It was there from the build with the change I made to the
maintainer-clean target, and went away when I built without it.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

Re: changeset generation v5-01 - Patches & git tree

From
Kevin Grittner
Date:
Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-06-23 10:32:05 -0700, Kevin Grittner wrote:

>> The contrib/test_logical_decoding/sql/ddl.sql script is generating
>> unexpected results.  For both table_with_pkey and
>> table_with_unique_not_null, updates of the primary key column are
>> showing:
>>
>> old-pkey: id[int4]:0
>>
>> ... instead of the expected value of 2 or -2.
>>
>> See attached.
>
> Hm. Any chance this was an incomplete rebuild?

With my hack on the pg_basebackup Makefile, `make -j4 world` is
finishing with no errors and:

PostgreSQL, contrib, and documentation successfully made. Ready to install.

> I seem to remember having seen that once because some header
> dependency wasn't recognized correctly after applying some patch.

I wonder whether this is related to the build problems we've been
discussing on the other fork of this thread.  I was surprised to
see this error when I got past the maintainer-clean full build
problems, because I thought I had seen clean `make check-world`
regression tests after applying each incremental patch file.  Until
I read this I had been assuming that somehow I missed the error on
the 16th and 17th iterations; but now I'm suspecting that I didn't
miss anything after all -- it may just be another symptom of the
build problems.

> Otherwise, could you give me:
> * the version you aplied the patch on

7dfd5cd21c0091e467b16b31a10e20bbedd0a836

> * os/compiler

Linux Kevin-Desktop 3.5.0-34-generic #55-Ubuntu SMP Thu Jun 6 20:18:19 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
gcc (Ubuntu/Linaro 4.7.2-2ubuntu1) 4.7.2

> Because I can't reproduce it, despite some playing around...

Maybe if you can reproduce the build problems I'm seeing....

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: changeset generation v5-01 - Patches & git tree

From
Andres Freund
Date:
On 2013-06-24 06:44:53 -0700, Kevin Grittner wrote:
> Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2013-06-23 08:27:32 -0700, Kevin Grittner wrote:
>
> >> make maintainer-clean ; ./configure --prefix=$PWD/Debug --enable-debug
> >> --enable-cassert --enable-depend --with-libxml --with-libxslt --with-openssl
> >> --with-perl --with-python && make -j4 world
>
> >> [ build failure referencing pg_receivellog.o ]
>
> > I have seen that once as well. It's really rather strange since
> > pg_receivellog.o is a clear prerequisite for pg_receivellog. I couldn't
> > reproduce it reliably though, even after doing some dozen rebuilds or so.
> >
> >> It works with this patch-on-patch:
>
> >>  clean distclean maintainer-clean:
> >> -   rm -f pg_basebackup$(X) pg_receivexlog$(X) $(OBJS) pg_basebackup.o
> >> pg_receivexlog.o pg_receivellog.o
> >> +   rm -f pg_basebackup$(X) pg_receivexlog$(X) pg_receivellog$(X) $(OBJS)
> >> pg_basebackup.o pg_receivexlog.o pg_receivellog.o
>
> >> > +  rm -f '$(DESTDIR)$(bindir)/pg_receivellog$(X)'
> >> Oops.  That part is not needed.
> >
> > Hm. Why not?
>
> Well, I could easily be wrong on just about anything to do with
> make files, but on a second look that appeared to be dealing with
> eliminating an installed pg_receivellog binary, which is not
> created.

I think it actually is?

install: all installdirs$(INSTALL_PROGRAM) pg_basebackup$(X) '$(DESTDIR)$(bindir)/pg_basebackup$(X)'$(INSTALL_PROGRAM)
pg_receivexlog$(X)'$(DESTDIR)$(bindir)/pg_receivexlog$(X)'$(INSTALL_PROGRAM) pg_receivellog$(X)
'$(DESTDIR)$(bindir)/pg_receivellog$(X)'

> > I don't think either hunk has anything to do with that buildfailure
> > though - can you reproduce the error without?
>
> I tried that scenario three times and it failed three times.  Then
> I made the above changes and it worked.  Then I eliminated the one
> on the uninstall target and tried a couple more times and it worked
> on both attempts.  The scenario is to have a `make world` build in
> the source tree, and run the above line starting with `make
> maintainer-clean` and going to `make -j4 world`.

Hm. I think it might be something in makes intermediate target logic
biting us. Anyway, if the patch fixes that: Great ;). Merged it logally
since it's obviously missing.

> I did notice that without that change to the maintainer-clean
> target I did not get a pg_receivellog.Po file in
> src/bin/pg_basebackup/.deps/ -- and with it I do.

Yea, according to your log it's not even built before pg_receivellog is
linked.

Thanks,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: changeset generation v5-01 - Patches & git tree

From
Kevin Grittner
Date:
Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-06-24 06:44:53 -0700, Kevin Grittner wrote:
>> Andres Freund <andres@2ndquadrant.com> wrote:
>>> On 2013-06-23 08:27:32 -0700, Kevin Grittner wrote:

>>>>> +  rm -f '$(DESTDIR)$(bindir)/pg_receivellog$(X)'
>>>> Oops.  That part is not needed.
>>>
>>> Hm. Why not?
>>
>> Well, I could easily be wrong on just about anything to do with
>> make files, but on a second look that appeared to be dealing with
>> eliminating an installed pg_receivellog binary, which is not
>> created.
>
> I think it actually is?

Oh, yeah....  I see it now.  I warned you I could be wrong.  :-/

I just had a thought thought -- perhaps the dependency information
is being calculated incorrectly.  Attached is the dependency file
from the successful build (with the adjusted Makefile), which still
fails the test_logical_decoding regression test, with the same diff.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

Re: changeset generation v5-01 - Patches & git tree

From
Andres Freund
Date:
On 2013-06-24 07:29:43 -0700, Kevin Grittner wrote:
> Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2013-06-23 10:32:05 -0700, Kevin Grittner wrote:
>
> >> The contrib/test_logical_decoding/sql/ddl.sql script is generating
> >> unexpected results.  For both table_with_pkey and
> >> table_with_unique_not_null, updates of the primary key column are
> >> showing:
> >>
> >> old-pkey: id[int4]:0
> >>
> >> ... instead of the expected value of 2 or -2.
> >>
> >> See attached.
> >
> > Hm. Any chance this was an incomplete rebuild?
>
> With my hack on the pg_basebackup Makefile, `make -j4 world` is
> finishing with no errors and:

Hm. There were some issues with the test_logical_decoding Makefile not
cleaning up the regression installation properly. Which might have
caused the issue.

Could you try after applying the patches and executing a clean and then
rebuild?

Otherwise, could you try applying my git tree so we are sure we test the
same thing?

$ git remote add af git://git.postgresql.org/git/users/andresfreund/postgres.git
$ git fetch af
$ git checkout -b xlog-decoding af/xlog-decoding-rebasing-cf4
$ ./configure ...
$ make

> > Because I can't reproduce it, despite some playing around...
>
> Maybe if you can reproduce the build problems I'm seeing....

Tried your recipe but still couldn't...

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: changeset generation v5-01 - Patches & git tree

From
Alvaro Herrera
Date:
I'm looking at the combined patches 0003-0005, which are essentially all
about adding a function to obtain relation OID from (tablespace,
filenode).  It takes care to look through the relation mapper, and uses
a new syscache underneath for performance.

One question about this patch, originally, was about the usage of
that relfilenode syscache.  It is questionable because it would be the
only syscache to apply on top of a non-unique index.  It is said that
this doesn't matter because the only non-unique values that can exist
would reference entries that have relfilenode = 0; and in turn this
doesn't matter because those values would be queried through the
relation mapper anyway, not from the syscache.  (This is implemented in
the higher-level function.)

This means that there would be one syscache that is damn easy to misuse
.. and we've setup things so that syscaches are very easy to use in the
first place.  From that perspective, this doesn't look good.  However,
it's an easy mistake to notice and fix, so perhaps this is not a serious
problem.  (I would much prefer for there to be a way to define partial
indexes in BKI.)

I'm not sure about the placing of the new SQL-callable function in
dbsize.c either.  It is certainly not a function that has anything to do
with object sizes.  The insides of it would belong more in lsyscache.c,
I think, except then that file does not otherwise concern itself with
the relation mapper so its scope would have to expand a bit.  But this
is no place for the SQL-callable portion, so that would have to find a
different home as well.

The other option, of course, it to provide a separate caching layer for
these objects altogether, but given how concise this implementation is,
it doesn't sound too palatable.

Thoughts?

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: changeset generation v5-01 - Patches & git tree

From
Andres Freund
Date:
Hi,

On 2013-06-27 17:33:04 -0400, Alvaro Herrera wrote:
> One question about this patch, originally, was about the usage of
> that relfilenode syscache.  It is questionable because it would be the
> only syscache to apply on top of a non-unique index.  It is said that
> this doesn't matter because the only non-unique values that can exist
> would reference entries that have relfilenode = 0; and in turn this
> doesn't matter because those values would be queried through the
> relation mapper anyway, not from the syscache.  (This is implemented in
> the higher-level function.)

Well, you can even query the syscache without hurt for mapped relations,
you just won't get an answer. The only thing you may not do because it
would yield multiple results is to query the syscache with
(tablespace, InvalidOid/0). Which is still not nice although it doesn't
make much sense to query with InvalidOid.

> I'm not sure about the placing of the new SQL-callable function in
> dbsize.c either.  It is certainly not a function that has anything to do
> with object sizes.

Not happy with that myself. I only placed the function there because
pg_relation_filenode() already was in it. Happy to change if somebody
has a good idea.

> (I would much prefer for there to be a way to define partial
> indexes in BKI.)

I don't think that's the hard part, it's that we don't use the full
machinery for updating indexes but rather the relatively simplistic
CatalogUpdateIndexes(). I am not sure we can guarantee that the required
infrastructure is available in all the cases to support doing generic
predicate evaluation.

Should bki really be the problem we probably could create the index
after bki based bootstrapping finished.

Thanks,

Andres Freund
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: changeset generation v5-01 - Patches & git tree

From
Alvaro Herrera
Date:
Andres Freund wrote:

> On 2013-06-27 17:33:04 -0400, Alvaro Herrera wrote:
> > One question about this patch, originally, was about the usage of
> > that relfilenode syscache.  It is questionable because it would be the
> > only syscache to apply on top of a non-unique index.  It is said that
> > this doesn't matter because the only non-unique values that can exist
> > would reference entries that have relfilenode = 0; and in turn this
> > doesn't matter because those values would be queried through the
> > relation mapper anyway, not from the syscache.  (This is implemented in
> > the higher-level function.)
> 
> Well, you can even query the syscache without hurt for mapped relations,
> you just won't get an answer. The only thing you may not do because it
> would yield multiple results is to query the syscache with
> (tablespace, InvalidOid/0). Which is still not nice although it doesn't
> make much sense to query with InvalidOid.

Yeah, I agree that it doesn't make sense to query for that.  The problem
is that something could reasonably be developed that uses the syscache
directly without checking whether the relfilenode is 0.

> > (I would much prefer for there to be a way to define partial
> > indexes in BKI.)
> 
> I don't think that's the hard part, it's that we don't use the full
> machinery for updating indexes but rather the relatively simplistic
> CatalogUpdateIndexes(). I am not sure we can guarantee that the required
> infrastructure is available in all the cases to support doing generic
> predicate evaluation.

You're right, CatalogIndexInsert() doesn't allow for predicates, so
fixing BKI would not help.

I still wonder about having a separate cache.  Right now pg_class has
two indexes; adding this new one would mean a rather large decrease in
insert performance (50% more indexes to update than previously), which
is not good considering that it's inserted into for each and every temp
table creation -- that would become slower.  This would be a net loss
for every user, even those that don't want logical replication.  On the
other hand, table creation also has to add tuples to pg_attribute,
pg_depend, pg_shdepend and maybe other catalogs, so perhaps the
difference is negligible.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: changeset generation v5-01 - Patches & git tree

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> I'm looking at the combined patches 0003-0005, which are essentially all
> about adding a function to obtain relation OID from (tablespace,
> filenode).  It takes care to look through the relation mapper, and uses
> a new syscache underneath for performance.

> One question about this patch, originally, was about the usage of
> that relfilenode syscache.  It is questionable because it would be the
> only syscache to apply on top of a non-unique index.

... which, I assume, is on top of a pg_class index that doesn't exist
today.  Exactly what is the argument that says performance of this
function is sufficiently critical to justify adding both the maintenance
overhead of a new pg_class index, *and* a broken-by-design syscache?

Lose the cache and this probably gets a lot easier to justify.  As is,
I think I'd vote to reject altogether.
        regards, tom lane



Re: changeset generation v5-01 - Patches & git tree

From
Robert Haas
Date:
On Thu, Jun 27, 2013 at 6:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>> I'm looking at the combined patches 0003-0005, which are essentially all
>> about adding a function to obtain relation OID from (tablespace,
>> filenode).  It takes care to look through the relation mapper, and uses
>> a new syscache underneath for performance.
>
>> One question about this patch, originally, was about the usage of
>> that relfilenode syscache.  It is questionable because it would be the
>> only syscache to apply on top of a non-unique index.
>
> ... which, I assume, is on top of a pg_class index that doesn't exist
> today.  Exactly what is the argument that says performance of this
> function is sufficiently critical to justify adding both the maintenance
> overhead of a new pg_class index, *and* a broken-by-design syscache?
>
> Lose the cache and this probably gets a lot easier to justify.  As is,
> I think I'd vote to reject altogether.

I already voted that way, and nothing's happened since to change my mind.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: changeset generation v5-01 - Patches & git tree

From
Kevin Grittner
Date:
Andres Freund <andres@2ndquadrant.com> wrote:

> Hm. There were some issues with the test_logical_decoding
> Makefile not cleaning up the regression installation properly.
> Which might have caused the issue.
>
> Could you try after applying the patches and executing a clean
> and then rebuild?

Tried, and problem persists.

> Otherwise, could you try applying my git tree so we are sure we
> test the same thing?
>
> $ git remote add af git://git.postgresql.org/git/users/andresfreund/postgres.git
> $ git fetch af
> $ git checkout -b xlog-decoding af/xlog-decoding-rebasing-cf4
> $ ./configure ...
> $ make

Tried that, too, and problem persists.  The log shows the last
commit on your branch as 022c2da1873de2fbc93ae524819932719ca41bdb.

Because you mention possible problems with the regression test
cleanup for test_logical_decoding I also tried:

rm -fr contrib/test_logical_decoding/
git reset --hard HEAD
make world
make check-world

I get the same failure, with primary key or unique index column
showing as 0 in results.

I am off on vacation tomorrow and next week.  Will dig into this
with gdb if not solved when I get back -- unless you have a better
suggestion for how to figure it out.

Once this is solved, I will be working with testing the final
result of all these layers, including creating a second output
plugin.  I want to confirm that multiple plugins play well
together.  I'm glad to see other eyes also on this patch set.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: changeset generation v5-01 - Patches & git tree

From
Andres Freund
Date:
On 2013-06-27 18:18:50 -0400, Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > I'm looking at the combined patches 0003-0005, which are essentially all
> > about adding a function to obtain relation OID from (tablespace,
> > filenode).  It takes care to look through the relation mapper, and uses
> > a new syscache underneath for performance.
> 
> > One question about this patch, originally, was about the usage of
> > that relfilenode syscache.  It is questionable because it would be the
> > only syscache to apply on top of a non-unique index.
> 
> ... which, I assume, is on top of a pg_class index that doesn't exist
> today.  Exactly what is the argument that says performance of this
> function is sufficiently critical to justify adding both the maintenance
> overhead of a new pg_class index, *and* a broken-by-design syscache?

Ok, so this requires some context. When we do the changeset extraction
we build a mvcc snapshot that for every heap wal record is consistent
with one made at the time the record has been inserted. Then, when we've
built that snapshot, we can use it to turn heap wal records into the
representation the user wants:

For that we first need to know which table a change comes from, since
otherwise we obviously cannot interpret the HeapTuple that's essentially
contained in the wal record without it. Since we have a correct mvcc
snapshot we can query pg_class for (tablespace, relfilenode) to get back
the relation. When we know the relation, the user (i.e. the output
pluggin) can use normal backend code to transform the HeapTuple into the
target representation, e.g. SQL, since we can build a TupleDesc. Since
the syscaches are synchronized with the built snapshot normal output
functions can be used.

What that means is that for every heap record in the target database in
the WAL we need to query pg_class to turn the relfilenode into a
pg_class.oid. So, we can easily replace syscache.c with some custom
caching code, but I don't think it's realistic to get rid of that
index. Otherwise we need to cache the entire pg_class in memory which
doesn't sound enticing.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: changeset generation v5-01 - Patches & git tree

From
Robert Haas
Date:
On Fri, Jun 28, 2013 at 3:32 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> What that means is that for every heap record in the target database in
> the WAL we need to query pg_class to turn the relfilenode into a
> pg_class.oid. So, we can easily replace syscache.c with some custom
> caching code, but I don't think it's realistic to get rid of that
> index. Otherwise we need to cache the entire pg_class in memory which
> doesn't sound enticing.

The alternative I previously proposed was to make the WAL records
carry the relation OID.  There are a few problems with that: one is
that it's a waste of space when logical replication is turned off, and
it might not be easy to only do it when logical replication is on.
Also, even when logic replication is turned on, things that make WAL
bigger aren't wonderful.  On the other hand, it does avoid the
overhead of another index on pg_class.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: changeset generation v5-01 - Patches & git tree

From
Andres Freund
Date:
On 2013-06-28 08:41:46 -0400, Robert Haas wrote:
> On Fri, Jun 28, 2013 at 3:32 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > What that means is that for every heap record in the target database in
> > the WAL we need to query pg_class to turn the relfilenode into a
> > pg_class.oid. So, we can easily replace syscache.c with some custom
> > caching code, but I don't think it's realistic to get rid of that
> > index. Otherwise we need to cache the entire pg_class in memory which
> > doesn't sound enticing.
> 
> The alternative I previously proposed was to make the WAL records
> carry the relation OID.  There are a few problems with that: one is
> that it's a waste of space when logical replication is turned off, and
> it might not be easy to only do it when logical replication is on.
> Also, even when logic replication is turned on, things that make WAL
> bigger aren't wonderful.  On the other hand, it does avoid the
> overhead of another index on pg_class.

I personally favor making catalog modifications a bit more more
expensive instead of increasing the WAL volume during routine
operations. I don't think index maintenance itself comes close to the
biggest cost for DDL we have atm.
It also increases the modifications needed to imporantant heap_*
functions which doesn't make me happy.

How do others see this tradeoff?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: changeset generation v5-01 - Patches & git tree

From
Andres Freund
Date:
On 2013-06-27 21:52:03 -0700, Kevin Grittner wrote:
> Tried that, too, and problem persists.  The log shows the last
> commit on your branch as 022c2da1873de2fbc93ae524819932719ca41bdb.

> I get the same failure, with primary key or unique index column
> showing as 0 in results.

I have run enough iterations of the test suite locally now that I am
confident it's not just happenstance that I don't see this :/. I am
going to clone your environment as closely as I can to see where the
issue might be as well as going over those codepaths...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: changeset generation v5-01 - Patches & git tree

From
Peter Eisentraut
Date:
On 6/28/13 8:46 AM, Andres Freund wrote:
> I personally favor making catalog modifications a bit more more
> expensive instead of increasing the WAL volume during routine
> operations. I don't think index maintenance itself comes close to the
> biggest cost for DDL we have atm.

That makes sense to me in principle.



Re: changeset generation v5-01 - Patches & git tree

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2013-06-28 08:41:46 -0400, Robert Haas wrote:
>> The alternative I previously proposed was to make the WAL records
>> carry the relation OID.  There are a few problems with that: one is
>> that it's a waste of space when logical replication is turned off, and
>> it might not be easy to only do it when logical replication is on.
>> Also, even when logic replication is turned on, things that make WAL
>> bigger aren't wonderful.  On the other hand, it does avoid the
>> overhead of another index on pg_class.

> I personally favor making catalog modifications a bit more more
> expensive instead of increasing the WAL volume during routine
> operations.

This argument is nonsense, since it conveniently ignores the added WAL
entries created as a result of additional pg_class index manipulations.

Robert's idea sounds fairly reasonable to me; another 4 bytes per
insert/update/delete WAL entry isn't that big a deal, and it would
probably ease many debugging tasks as well as what you want to do.
So I'd vote for including the rel OID all the time, not conditionally.

The real performance argument against the patch as you have it is that
it saddles every PG installation with extra overhead for pg_class
updates whether or not that installation ever has or ever will make use
of changeset generation --- unlike including rel OIDs in WAL entries,
which might be merely difficult to handle conditionally, it's flat-out
impossible to turn such an index on or off.  Moreover, even if one is
using changeset generation, the overhead is being imposed at the wrong
place, ie the master not the slave doing changeset extraction.

But that's not the only problem, nor even the worst one IMO.  I said
before that a syscache with a non-unique key is broken by design, and
I stand by that estimate.  Even assuming that this usage doesn't create
bugs in the code as it stands, it might well foreclose future changes or
optimizations that we'd like to make in the catcache code.

If you don't want to change WAL contents, what I think you should do
is create a new cache mechanism (perhaps by extending the relmapper)
that caches relfilenode to OID lookups and acts entirely inside the
changeset-generating slave.  Hacking up the catcache instead of doing
that is an expedient kluge that will come back to bite us.
        regards, tom lane



Re: changeset generation v5-01 - Patches & git tree

From
Andres Freund
Date:
On 2013-06-28 10:49:26 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2013-06-28 08:41:46 -0400, Robert Haas wrote:
> >> The alternative I previously proposed was to make the WAL records
> >> carry the relation OID.  There are a few problems with that: one is
> >> that it's a waste of space when logical replication is turned off, and
> >> it might not be easy to only do it when logical replication is on.
> >> Also, even when logic replication is turned on, things that make WAL
> >> bigger aren't wonderful.  On the other hand, it does avoid the
> >> overhead of another index on pg_class.
> 
> > I personally favor making catalog modifications a bit more more
> > expensive instead of increasing the WAL volume during routine
> > operations.
> 
> This argument is nonsense, since it conveniently ignores the added WAL
> entries created as a result of additional pg_class index manipulations.

Huh? Sure, pg_class manipulations get more expensive. But in most
clusters pg_class modifications are by far the minority compared to the
rest of the updates performed.

> Robert's idea sounds fairly reasonable to me; another 4 bytes per
> insert/update/delete WAL entry isn't that big a deal, and it would
> probably ease many debugging tasks as well as what you want to do.
> So I'd vote for including the rel OID all the time, not conditionally.

Ok, I can sure live with that. I don't think it's a problem to make it
conditionally if we want to. Making it unconditional would sure make WAL
debugging in general more pleasant though.

> The real performance argument against the patch as you have it is that
> it saddles every PG installation with extra overhead for pg_class
> updates whether or not that installation ever has or ever will make use
> of changeset generation --- unlike including rel OIDs in WAL entries,
> which might be merely difficult to handle conditionally, it's flat-out
> impossible to turn such an index on or off.  Moreover, even if one is
> using changeset generation, the overhead is being imposed at the wrong
> place, ie the master not the slave doing changeset extraction.

There's no required slaves for doing changeset extraction
anymore. Various people opposed that pretty violently, so it's now all
happening on the master. Which IMHO turned out to be the right decision.

We can do it on Hot Standby nodes, but its absolutely not required.

> But that's not the only problem, nor even the worst one IMO.  I said
> before that a syscache with a non-unique key is broken by design, and
> I stand by that estimate.  Even assuming that this usage doesn't create
> bugs in the code as it stands, it might well foreclose future changes or
> optimizations that we'd like to make in the catcache code.

Since the only duplicate key that possibly can occur in that cache is
InvalidOid, I wondered whether we could define a 'filter' that prohibits
those ending up in the cache? Then the cache would be unique.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: changeset generation v5-01 - Patches & git tree

From
Robert Haas
Date:
On Fri, Jun 28, 2013 at 10:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert's idea sounds fairly reasonable to me; another 4 bytes per
> insert/update/delete WAL entry isn't that big a deal, ...

How big a deal is it?  This is a serious question, because I don't
know.  Let's suppose that the average size of an XLOG_HEAP_INSERT
record is 100 bytes.  Then if we add 4 bytes, isn't that a 4%
overhead?  And doesn't that seem significant?

I'm just talking out of my rear end here because I don't know what the
real numbers are, but it's far from obvious to me that there's any
free lunch here.  That having been said, just because indexing
relfilenode or adding relfilenodes to WAL records is expensive doesn't
mean we shouldn't do it.  But I think we need to know the price tag
before we can judge whether to make the purchase.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: changeset generation v5-01 - Patches & git tree

From
Alvaro Herrera
Date:
Robert Haas escribió:
> On Fri, Jun 28, 2013 at 10:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Robert's idea sounds fairly reasonable to me; another 4 bytes per
> > insert/update/delete WAL entry isn't that big a deal, ...
> 
> How big a deal is it?  This is a serious question, because I don't
> know.  Let's suppose that the average size of an XLOG_HEAP_INSERT
> record is 100 bytes.  Then if we add 4 bytes, isn't that a 4%
> overhead?  And doesn't that seem significant?

An INSERT wal record is:

typedef struct xl_heap_insert
{xl_heaptid    target;            /* inserted tuple id */bool        all_visible_cleared;    /* PD_ALL_VISIBLE was
cleared*//* xl_heap_header & TUPLE DATA FOLLOWS AT END OF STRUCT */
 
} xl_heap_insert;

typedef struct xl_heap_header
{uint16        t_infomask2;uint16        t_infomask;uint8        t_hoff;
} xl_heap_header;

So the fixed part is just 7 bytes + 5 bytes; tuple data follows that.
So adding four more bytes could indeed be significant (but by how much,
depends on the size of the tuple data).  Adding a new pg_class index
would be larger in the sense that there are more WAL records, and
there's the extra vacuuming traffic; but on the other hand that would
only happen when tables are created.  It seems safe to assume that in
normal use cases the ratio of tuple insertion vs. table creation is
large.

The only idea that springs to mind is to have the new pg_class index be
created conditionally, i.e. only when logical replication is going to be
used.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: changeset generation v5-01 - Patches & git tree

From
Alvaro Herrera
Date:
Alvaro Herrera escribió:

> An INSERT wal record is:
> 
> typedef struct xl_heap_insert
> {
>     xl_heaptid    target;            /* inserted tuple id */
>     bool        all_visible_cleared;    /* PD_ALL_VISIBLE was cleared */
>     /* xl_heap_header & TUPLE DATA FOLLOWS AT END OF STRUCT */
> } xl_heap_insert;

Oops.  xl_heaptid is not 6 bytes, but instead:

typedef struct xl_heaptid
{RelFileNode node;ItemPointerData tid;
} xl_heaptid;

typedef struct RelFileNode
{Oid            spcNode;Oid            dbNode;Oid            relNode;
} RelFileNode;        /* 12 bytes */

typedef struct ItemPointerData
{BlockIdData ip_blkid;OffsetNumber ip_posid;
};            /* 6 bytes */

typedef struct BlockIdData
{uint16        bi_hi;uint16        bi_lo;
} BlockIdData;    /* 4 bytes */

typedef uint16 OffsetNumber;

There's purposely no alignment padding anywhere, so xl_heaptid totals 22 bytes.


Therefore,

> So the fixed part is just 22 bytes + 5 bytes; tuple data follows that.
> So adding four more bytes could indeed be significant (but by how much,
> depends on the size of the tuple data).

4 extra bytes on top of 27 is 14% of added overhead (considering only
the xlog header.)

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: changeset generation v5-01 - Patches & git tree

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I'm just talking out of my rear end here because I don't know what the
> real numbers are, but it's far from obvious to me that there's any
> free lunch here.  That having been said, just because indexing
> relfilenode or adding relfilenodes to WAL records is expensive doesn't
> mean we shouldn't do it.  But I think we need to know the price tag
> before we can judge whether to make the purchase.

Certainly, any of these solutions are going to cost us somewhere ---
either up-front cost or more expensive (and less reliable?) changeset
extraction, take your choice.  I will note that somehow tablespaces got
put in despite having to add 4 bytes to every WAL record for that
feature, which was probably of less use than logical changeset
extraction will be.

But to tell the truth, I'm mostly exercised about the non-unique
syscache.  I think that's simply a *bad* idea.
        regards, tom lane



Re: changeset generation v5-01 - Patches & git tree

From
Robert Haas
Date:
On Fri, Jun 28, 2013 at 11:56 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> I'm just talking out of my rear end here because I don't know what the
>> real numbers are, but it's far from obvious to me that there's any
>> free lunch here.  That having been said, just because indexing
>> relfilenode or adding relfilenodes to WAL records is expensive doesn't
>> mean we shouldn't do it.  But I think we need to know the price tag
>> before we can judge whether to make the purchase.
>
> Certainly, any of these solutions are going to cost us somewhere ---
> either up-front cost or more expensive (and less reliable?) changeset
> extraction, take your choice.  I will note that somehow tablespaces got
> put in despite having to add 4 bytes to every WAL record for that
> feature, which was probably of less use than logical changeset
> extraction will be.

Right.  I actually think we booted that one.  The database ID is a
constant for most people.  The tablespace ID is not technically
redundant, but in 99.99% of cases you could figure it out from the
database ID + relation ID.  The relation ID is where 99% of the
entropy is, but it probably only has 8-16 bits of entropy in most
real-world use cases.  If we were doing this over we might want to
think about storing a proxy for the relfilenode rather than the
relfilenode itself, but there's not much good crying over it now.

> But to tell the truth, I'm mostly exercised about the non-unique
> syscache.  I think that's simply a *bad* idea.

+1.

I don't think the extra index on pg_class is going to hurt that much,
even if we create it always, as long as we use a purpose-built caching
mechanism for it rather than forcing it through catcache.  The people
who are going to suffer are the ones who create and drop a lot of
temporary tables, but even there I'm not sure how visible the overhead
will be on real-world workloads, and maybe the solution is to work
towards not having permanent catalog entries for temporary tables in
the first place.  In any case, hurting people who use temporary tables
heavily seems better than adding overhead to every
insert/update/delete operation, which will hit all users who are not
read-only.

On the other hand, I can't entirely shake the feeling that adding the
information into WAL would be more reliable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: changeset generation v5-01 - Patches & git tree

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On the other hand, I can't entirely shake the feeling that adding the
> information into WAL would be more reliable.

That feeling has been nagging at me too.  I can't demonstrate that
there's a problem when an ALTER TABLE is in process of rewriting a table
into a new relfilenode number, but I don't have a warm fuzzy feeling
about the reliability of reverse lookups for this.  At the very least
it's going to require some hard-to-verify restriction about how we
can't start doing changeset reconstruction in the middle of a
transaction that's doing DDL.
        regards, tom lane



Re: changeset generation v5-01 - Patches & git tree

From
Simon Riggs
Date:
On 28 June 2013 17:10, Robert Haas <robertmhaas@gmail.com> wrote:
 
> But to tell the truth, I'm mostly exercised about the non-unique
> syscache.  I think that's simply a *bad* idea.

+1.

I don't think the extra index on pg_class is going to hurt that much,
even if we create it always, as long as we use a purpose-built caching
mechanism for it rather than forcing it through catcache.  

Hmm, does seem like that would be better.
 
The people
who are going to suffer are the ones who create and drop a lot of
temporary tables, but even there I'm not sure how visible the overhead
will be on real-world workloads, and maybe the solution is to work
towards not having permanent catalog entries for temporary tables in
the first place.  In any case, hurting people who use temporary tables
heavily seems better than adding overhead to every
insert/update/delete operation, which will hit all users who are not
read-only.

Thinks...

If we added a trigger that fired a NOTIFY for any new rows in pg_class that relate to non-temporary relations that would optimise away any overhead for temporary tables or when no changeset extraction was in progress.

The changeset extraction could build a private hash table to perform the lookup and then LISTEN on a specific channel for changes.

That might work better than an index-plus-syscache.
 
On the other hand, I can't entirely shake the feeling that adding the
information into WAL would be more reliable.

I don't really like the idea of requiring the relid on the WAL record. WAL is big enough already and we want people to turn this on, not avoid it.

This is just an index lookup. We do them all the time without any fear of reliability issues.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: changeset generation v5-01 - Patches & git tree

From
Andres Freund
Date:
On 2013-06-28 12:26:52 -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On the other hand, I can't entirely shake the feeling that adding the
> > information into WAL would be more reliable.
> 
> That feeling has been nagging at me too.  I can't demonstrate that
> there's a problem when an ALTER TABLE is in process of rewriting a table
> into a new relfilenode number, but I don't have a warm fuzzy feeling
> about the reliability of reverse lookups for this.

I am pretty sure the mapping thing works, but it indeed requires some
complexity. And it's harder to debug because when you want to understand
what's going on the relfilenodes involved aren't in the catalog anymore.

> At the very least
> it's going to require some hard-to-verify restriction about how we
> can't start doing changeset reconstruction in the middle of a
> transaction that's doing DDL.

Currently changeset extraction needs to wait (and does so) till it found
a point where it has seen the start of all in-progress transactions. All
transaction that *commit* after the last partiall observed in-progress
transaction finished can be decoded.
To make that point visible for external tools to synchronize -
e.g. pg_dump - it exports the snapshot of exactly the moment when that
last in-progress transaction committed.

So, from what I gather there's a slight leaning towards *not* storing
the relation's oid in the WAL. Which means the concerns about the
uniqueness issues with the syscaches need to be addressed. So far I know
of three solutions:
1) develop a custom caching/mapping module
2) Make sure InvalidOid's (the only possible duplicate) can't end up the  syscache by adding a hook that prevents that
onthe catcache level
 
3) Make sure that there can't be any duplicates by storing the oid of  the relation in a mapped relations relfilenode

Opinions?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: changeset generation v5-01 - Patches & git tree

From
Alvaro Herrera
Date:
Since this discussion seems to have stalled, let me do a quick summary.
The goal of this subset of patches is to allow retroactive look up of
relations starting from a WAL record.  Currently, the WAL record only
tracks the relfilenode that it affects, so there are two possibilities:

1. we add some way to find out the relation OID from the relfilenode,
2. we augment the WAL record with the relation OID.

Each solution has its drawbacks.  For the former,
* we need a new cache
* we need a new pg_class index
* looking up the relation OID still requires some CPU runtime and memory to keep the caches in; run invalidations,
etc.

For the latter,
* each WAL record would become somewhat bigger.  For WAL records with a payload of 25 bytes (say insert a tuple which
is25 bytes long) this means about 7% overhead.
 

There are some other issues, but these can be solved.  For instance Tom
doesn't want a syscache on top of a non-unique index, and I agree on
that.  But if we agree on this way forward, we can just go a different
route by keeping a separate cache layer.

So the question is, do we take the overhead of the new index (which
means overhead on DML operations -- supposedly rare) or do we take the
overhead of larger WAL records (which means overhead on all DDL
operations)?

Note we can make either thing apply to only people running logical
replication.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: changeset generation v5-01 - Patches & git tree

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> So the question is, do we take the overhead of the new index (which
> means overhead on DML operations -- supposedly rare) or do we take the
> overhead of larger WAL records (which means overhead on all DDL
> operations)?

> Note we can make either thing apply to only people running logical
> replication.

I don't believe you can have or not have an index on pg_class as easily
as all that.  The choice would have to be frozen at initdb time, so
people would have to pay the overhead if they thought there was even a
small possibility that they'd want logical replication later.

Flipping the content of WAL records might not be a terribly simple thing
to do either, but at least in principle it could be done during a
postmaster restart, without initdb.
        regards, tom lane



Re: changeset generation v5-01 - Patches & git tree

From
Andres Freund
Date:
On 2013-07-01 14:16:55 -0400, Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > So the question is, do we take the overhead of the new index (which
> > means overhead on DML operations -- supposedly rare) or do we take the
> > overhead of larger WAL records (which means overhead on all DDL
> > operations)?
> 
> > Note we can make either thing apply to only people running logical
> > replication.
> 
> I don't believe you can have or not have an index on pg_class as easily
> as all that.  The choice would have to be frozen at initdb time, so
> people would have to pay the overhead if they thought there was even a
> small possibility that they'd want logical replication later.

It should be possible to create the index in a single database when we
start logical replication in that database? Running the index creation
with a fixed oid shouldn't require too much code. The oid won't be
reused by other pg_class entries since it would be a system one.
Alternatively we could always create the index's pg_class/index entry
but mark it as !indislive when logical replication isn't active for that
database. Then activating it would just require rebuilding that
index.

But then, I am not fully convinced that's worth the trouble since I
don't think pg_class index maintenance is the painspot in DDL atm.

> Flipping the content of WAL records might not be a terribly simple thing
> to do either, but at least in principle it could be done during a
> postmaster restart, without initdb.

The main patch combines various booleans in the heap wal records into a
flags variable, so there should be enough space to keep track of it
without increasing size. Makes size calculations a bit more annoying
though as we use the xlog record length to calculate the heap tuple's
length, but that's not a large problem.
So we could just set the XLOG_HEAP_CONTAINS_CLASSOID flag if wal_level
>= WAL_LEVEL_LOGICAL. Wal decoding then can throw a tantrum if it finds
a record without it and we're done.

We could even make that per database, but that seems to be something for
the future.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: changeset generation v5-01 - Patches & git tree

From
Simon Riggs
Date:
On 27 June 2013 23:18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
 
Exactly what is the argument that says performance of this
function is sufficiently critical to justify adding both the maintenance
overhead of a new pg_class index, *and* a broken-by-design syscache?

I think we all agree on changing the syscache.

I'm not clear why adding a new permanent index to pg_class is such a problem. It's going to be a very thin index. I'm trying to imagine a use case that has pg_class index maintenance as a major part of its workload and I can't. An extra index on pg_attribute and I might agree with you. The pg_class index would only be a noticeable % of catalog rows for very thin temp tables, but would still even then be small; that isn't even necessary work since we all agree that temp table overheads could and should be optimised away somwhere. So blocking a new index because of that sounds strange.

What issues do you foresee? How can we test them?

Or perhaps we should just add the index and see if we later discover a measurable problem workload?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: changeset generation v5-01 - Patches & git tree

From
Andres Freund
Date:
On 2013-06-27 21:52:03 -0700, Kevin Grittner wrote:
> Andres Freund <andres@2ndquadrant.com> wrote:
>
> > Hm. There were some issues with the test_logical_decoding
> > Makefile not cleaning up the regression installation properly.
> > Which might have caused the issue.
> >
> > Could you try after applying the patches and executing a clean
> > and then rebuild?
>
> Tried, and problem persists.
>
> > Otherwise, could you try applying my git tree so we are sure we
> > test the same thing?
> >
> > $ git remote add af git://git.postgresql.org/git/users/andresfreund/postgres.git
> > $ git fetch af
> > $ git checkout -b xlog-decoding af/xlog-decoding-rebasing-cf4
> > $ ./configure ...
> > $ make
>
> Tried that, too, and problem persists.  The log shows the last
> commit on your branch as 022c2da1873de2fbc93ae524819932719ca41bdb.

Ok. I think I have a slight idea what's going on. Could you check
whether recompiling with -O0 "fixes" the issue?

There's something strange going on here, not sure whether it's just a
bug that's hidden, by either not doing optimizations or by adding more
elog()s, or wheter it's a compiler bug.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: changeset generation v5-01 - Patches & git tree

From
Andres Freund
Date:
On 2013-07-05 14:03:56 +0200, Andres Freund wrote:
> On 2013-06-27 21:52:03 -0700, Kevin Grittner wrote:
> > Andres Freund <andres@2ndquadrant.com> wrote:
> >
> > > Hm. There were some issues with the test_logical_decoding
> > > Makefile not cleaning up the regression installation properly.
> > > Which might have caused the issue.
> > >
> > > Could you try after applying the patches and executing a clean
> > > and then rebuild?
> >
> > Tried, and problem persists.
> >
> > > Otherwise, could you try applying my git tree so we are sure we
> > > test the same thing?
> > >
> > > $ git remote add af git://git.postgresql.org/git/users/andresfreund/postgres.git
> > > $ git fetch af
> > > $ git checkout -b xlog-decoding af/xlog-decoding-rebasing-cf4
> > > $ ./configure ...
> > > $ make
> >
> > Tried that, too, and problem persists.  The log shows the last
> > commit on your branch as 022c2da1873de2fbc93ae524819932719ca41bdb.
>
> Ok. I think I have a slight idea what's going on. Could you check
> whether recompiling with -O0 "fixes" the issue?
>
> There's something strange going on here, not sure whether it's just a
> bug that's hidden, by either not doing optimizations or by adding more
> elog()s, or wheter it's a compiler bug.

Ok. It was supreme stupidity on my end. Sorry for the time you spent on
it.

Some versions of gcc (and probably other compilers) were removing
sections of code when optimizing because the code was doing undefined
things. Parts of the rdata chain were allocated locally in an
if (needs_key). Which obviously is utterly bogus... A warning would have
been nice though.

Fix pushed and attached.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: changeset generation v5-01 - Patches & git tree

From
Steve Singer
Date:
On 07/05/2013 08:03 AM, Andres Freund wrote:
> On 2013-06-27 21:52:03 -0700, Kevin Grittner wrote:
>> Tried that, too, and problem persists.  The log shows the last commit 
>> on your branch as 022c2da1873de2fbc93ae524819932719ca41bdb. 
> Ok. I think I have a slight idea what's going on. Could you check
> whether recompiling with -O0 "fixes" the issue?
>
> There's something strange going on here, not sure whether it's just a
> bug that's hidden, by either not doing optimizations or by adding more
> elog()s, or wheter it's a compiler bug.

I am getting the same test failure Kevin is seeing.
This is on a x64 Debian wheezy machine with
gcc (Debian 4.7.2-5) 4.7.2

Building with -O0 results in passing tests.



> Greetings,
>
> Andres Freund
>




Re: changeset generation v5-01 - Patches & git tree

From
Andres Freund
Date:
On 2013-07-05 09:28:45 -0400, Steve Singer wrote:
> On 07/05/2013 08:03 AM, Andres Freund wrote:
> >On 2013-06-27 21:52:03 -0700, Kevin Grittner wrote:
> >>Tried that, too, and problem persists.  The log shows the last commit on
> >>your branch as 022c2da1873de2fbc93ae524819932719ca41bdb.
> >Ok. I think I have a slight idea what's going on. Could you check
> >whether recompiling with -O0 "fixes" the issue?
> >
> >There's something strange going on here, not sure whether it's just a
> >bug that's hidden, by either not doing optimizations or by adding more
> >elog()s, or wheter it's a compiler bug.
> 
> I am getting the same test failure Kevin is seeing.
> This is on a x64 Debian wheezy machine with
> gcc (Debian 4.7.2-5) 4.7.2
> 
> Building with -O0 results in passing tests.

Does the patch from
http://archives.postgresql.org/message-id/20130705132513.GB11640%40awork2.anarazel.de
or the git tree (which is rebased ontop of the mvcc catalog commit from
robert which needs some changes) fix it, even with optimizations?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: changeset generation v5-01 - Patches & git tree

From
Steve Singer
Date:
On 07/05/2013 09:34 AM, Andres Freund wrote:
> On 2013-07-05 09:28:45 -0400, Steve Singer wrote:
>> On 07/05/2013 08:03 AM, Andres Freund wrote:
>>> On 2013-06-27 21:52:03 -0700, Kevin Grittner wrote:
>>>> Tried that, too, and problem persists.  The log shows the last commit on
>>>> your branch as 022c2da1873de2fbc93ae524819932719ca41bdb.
>>> Ok. I think I have a slight idea what's going on. Could you check
>>> whether recompiling with -O0 "fixes" the issue?
>>>
>>> There's something strange going on here, not sure whether it's just a
>>> bug that's hidden, by either not doing optimizations or by adding more
>>> elog()s, or wheter it's a compiler bug.
>> I am getting the same test failure Kevin is seeing.
>> This is on a x64 Debian wheezy machine with
>> gcc (Debian 4.7.2-5) 4.7.2
>>
>> Building with -O0 results in passing tests.
> Does the patch from
> http://archives.postgresql.org/message-id/20130705132513.GB11640%40awork2.anarazel.de
> or the git tree (which is rebased ontop of the mvcc catalog commit from
> robert which needs some changes) fix it, even with optimizations?


Yes your latest git tree the tests pass with -O2


> Greetings,
>
> Andres Freund
>




Re: changeset generation v5-01 - Patches & git tree

From
Steve Singer
Date:
On 06/14/2013 06:51 PM, Andres Freund wrote:
> The git tree is at:
> git://git.postgresql.org/git/users/andresfreund/postgres.git branch xlog-decoding-rebasing-cf4
>
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-decoding-rebasing-cf4


We discussed issues related to passing options to the plugins a number 
of months ago ( 
http://www.postgresql.org/message-id/20130129015732.GA24238@awork2.anarazel.de)

I'm still having issues with the syntax you describe there.


START_LOGICAL_REPLICATION "1" 0/0 ("foo","bar") unexpected termination of replication stream: ERROR:  foo requires a 
parameter

START_LOGICAL_REPLICATION "1" 0/0 ("foo" "bar")
"START_LOGICAL_REPLICATION "1" 0/0 ("foo" "bar")": ERROR:  syntax error

START_LOGICAL_REPLICATION "1" 0/0 ("foo")
works okay



Steve




> On 2013-06-15 00:48:17 +0200, Andres Freund wrote:
>> Overview of the attached patches:
>> 0001: indirect toast tuples; required but submitted independently
>> 0002: functions for testing; not required,
>> 0003: (tablespace, filenode) syscache; required
>> 0004: RelationMapFilenodeToOid: required, simple
>> 0005: pg_relation_by_filenode() function; not required but useful
>> 0006: Introduce InvalidCommandId: required, simple
>> 0007: Adjust Satisfies* interface: required, mechanical,
>> 0008: Allow walsender to attach to a database: required, needs review
>> 0009: New GetOldestXmin() parameter; required, pretty boring
>> 0010: Log xl_running_xact regularly in the bgwriter: required
>> 0011: make fsync_fname() public; required, needs to be in a different file
>> 0012: Relcache support for an Relation's primary key: required
>> 0013: Actual changeset extraction; required
>> 0014: Output plugin demo; not required (except for testing) but useful
>> 0015: Add pg_receivellog program: not required but useful
>> 0016: Add test_logical_decoding extension; not required, but contains
>>        the tests for the feature. Uses 0014
>> 0017: Snapshot building docs; not required
> Version v5-01 attached
>
> Greetings,
>
> Andres Freund
>
>
>




Re: changeset generation v5-01 - Patches & git tree

From
Andres Freund
Date:
On 2013-07-05 11:33:20 -0400, Steve Singer wrote:
> On 06/14/2013 06:51 PM, Andres Freund wrote:
> >The git tree is at:
> >git://git.postgresql.org/git/users/andresfreund/postgres.git branch xlog-decoding-rebasing-cf4
>
>http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-decoding-rebasing-cf4
> 
> 
> We discussed issues related to passing options to the plugins a number of
> months ago ( http://www.postgresql.org/message-id/20130129015732.GA24238@awork2.anarazel.de)
> 
> I'm still having issues with the syntax you describe there.
> 
> 
> START_LOGICAL_REPLICATION "1" 0/0 ("foo","bar")
>  unexpected termination of replication stream: ERROR:  foo requires a
> parameter

I'd guess that's coming from your output plugin? You're using
defGetString() on DefElem without a value?

> START_LOGICAL_REPLICATION "1" 0/0 ("foo" "bar")

Yes, the option *names* are identifiers, together with plugin & slot
names. The passed values need to be SCONSTs atm
(src/backend/replication/repl_gram.y):

plugin_opt_elem:        IDENT plugin_opt_arg            {                $$ = makeDefElem($1, $2);            }    ;

plugin_opt_arg:        SCONST                            { $$ = (Node *) makeString($1); }        | /* EMPTY */
          { $$ = NULL; }    ;
 

So, it would have to be:
START_LOGICAL_REPLICATION "1" 0/0 ("foo" 'bar blub frob', "sup" 'star', "noarg")

Now that's not completely obvious, I admit :/. Better suggestions?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: changeset generation v5-01 - Patches & git tree

From
Andres Freund
Date:
On 2013-06-28 21:47:47 +0200, Andres Freund wrote:
> So, from what I gather there's a slight leaning towards *not* storing
> the relation's oid in the WAL. Which means the concerns about the
> uniqueness issues with the syscaches need to be addressed. So far I know
> of three solutions:
> 1) develop a custom caching/mapping module
> 2) Make sure InvalidOid's (the only possible duplicate) can't end up the
>    syscache by adding a hook that prevents that on the catcache level
> 3) Make sure that there can't be any duplicates by storing the oid of
>    the relation in a mapped relations relfilenode

So, here's 4 patches:
1) add RelationMapFilenodeToOid()
2) Add pg_class index on (reltablespace, relfilenode)
3a) Add custom cache that maps from filenode to oid
3b) Add catcache 'filter' that ensures the cache stays unique and use
    that for the mapping
4) Add pg_relation_by_filenode() and use it in a regression test

3b) adds an optional 'filter' attribute to struct cachedesc in
    syscache.c which is then passed to catcache.c. If it's existant
    catcache.c uses it - after checking for a match in the cache - to
    check whether the queried-for value possibly should end up in the
    cache. If not it stores a whiteout entry as currently already done
    for nonexistant entries.
    It also reorders some catcache.h struct attributes to make sure
    we're not growing them. Might make sense to apply that
    independently, those are rather heavily used.

I slightly prefer 3b) because it's smaller, what's your opinions?

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: changeset generation v5-01 - Patches & git tree

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> 3b) Add catcache 'filter' that ensures the cache stays unique and use
>     that for the mapping

> I slightly prefer 3b) because it's smaller, what's your opinions?

This is just another variation on the theme of kluging the catcache to
do something it shouldn't.  You're still building a catcache on a
non-unique index, and that is going to lead to trouble.

(I'm a bit surprised that there is no Assert in catcache.c checking
that the index nominated to support a catcache is unique ...)
        regards, tom lane



Re: changeset generation v5-01 - Patches & git tree

From
Andres Freund
Date:
On 2013-07-07 15:43:17 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > 3b) Add catcache 'filter' that ensures the cache stays unique and use
> >     that for the mapping
> 
> > I slightly prefer 3b) because it's smaller, what's your opinions?
> 
> This is just another variation on the theme of kluging the catcache to
> do something it shouldn't.  You're still building a catcache on a
> non-unique index, and that is going to lead to trouble.

I don't think the lurking dangers really are present. The index
essentially *is* unique since we filter away anything non-unique. The
catcache code hardly can be confused by tuples it never sees. That would
even work if we started preloading catcaches by doing scans of the
entire underlying relation or by caching all of a page when reading one
of its tuples.

I can definitely see that there are "aesthetical" reasons against doing
3b), that's why I've also done 3a). So I'll chalk you up to voting for
that...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: changeset generation v5-01 - Patches & git tree

From
Kevin Grittner
Date:
Sorry for the delay in reviewing this.  I must make sure never to
take another vacation during a commitfest -- the backlog upon
return is a killer....

Kevin Grittner <kgrittn@ymail.com> wrote:
> Andres Freund <andres@2ndquadrant.com> wrote:

>> Otherwise, could you try applying my git tree so we are sure we
>> test the same thing?
>>
>> $ git remote add af git://git.postgresql.org/git/users/andresfreund/postgres.git
>> $ git fetch af
>> $ git checkout -b xlog-decoding af/xlog-decoding-rebasing-cf4
>> $ ./configure ...
>> $ make
>
> Tried that, too, and problem persists.  The log shows the last
> commit on your branch as 022c2da1873de2fbc93ae524819932719ca41bdb.

The good news: the regression tests now work for me, and I'm back
on testing this at a high level.

The bad news:

(1)  The code checked out from that branch does not merge with
master.  Not surprisingly, given the recent commits, xlog.c is a
problem.  Is there another branch I should now be using?  If not,
please let me know when I can test with something that applies on
top of the master branch.

(2)  An initial performance test didn't look very good.  I will be
running a more controlled test to confirm but the logical
replication of a benchmark with a lot of UPDATEs of compressed text
values seemed to suffer with the logical replication turned on.
Any suggestions or comments on that front, before I run the more
controlled benchmarks?

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: changeset generation v5-01 - Patches & git tree

From
Andres Freund
Date:
On 2013-07-10 12:21:23 -0700, Kevin Grittner wrote:
> Sorry for the delay in reviewing this.  I must make sure never to
> take another vacation during a commitfest -- the backlog upon
> return is a killer....

Heh. Yes. Been through it before...

> Kevin Grittner <kgrittn@ymail.com> wrote:
> > Andres Freund <andres@2ndquadrant.com> wrote:
> 
> >> Otherwise, could you try applying my git tree so we are sure we
> >> test the same thing?
> >>
> >> $ git remote add af git://git.postgresql.org/git/users/andresfreund/postgres.git
> >> $ git fetch af
> >> $ git checkout -b xlog-decoding af/xlog-decoding-rebasing-cf4
> >> $ ./configure ...
> >> $ make
> >
> > Tried that, too, and problem persists.  The log shows the last
> > commit on your branch as 022c2da1873de2fbc93ae524819932719ca41bdb.
> 
> The good news: the regression tests now work for me, and I'm back
> on testing this at a high level.

> The bad news:
> 
> (1)  The code checked out from that branch does not merge with
> master.  Not surprisingly, given the recent commits, xlog.c is a
> problem.  Is there another branch I should now be using?  If not,
> please let me know when I can test with something that applies on
> top of the master branch.

That one is actually relatively easy to resolve. The mvcc catalog scan
patch is slightly harder. I've pushed an updated patch that fixes the
latter in a slightly not-so-nice way. I am not sure yet how the final
fix for that's going to look like, depends on whether we will get rid of
SnapshotNow alltogether...

I'll push my local tree with that fixed in a sec.

> (2)  An initial performance test didn't look very good.  I will be
> running a more controlled test to confirm but the logical
> replication of a benchmark with a lot of UPDATEs of compressed text
> values seemed to suffer with the logical replication turned on.
> Any suggestions or comments on that front, before I run the more
> controlled benchmarks?

Hm. There theoretically shouldn't actually be anything added in that
path. Could you roughly sketch what that test is doing? Do you actually
stream those changes out or did you just turn on wal_level=logical?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: changeset generation v5-01 - Patches & git tree

From
Kevin Grittner
Date:
Andres Freund <andres@2ndquadrant.com> wrote:
> Kevin Grittner <kgrittn@ymail.com> wrote:

>> (2)  An initial performance test didn't look very good.  I will be
>> running a more controlled test to confirm but the logical
>> replication of a benchmark with a lot of UPDATEs of compressed text
>> values seemed to suffer with the logical replication turned on.
>> Any suggestions or comments on that front, before I run the more
>> controlled benchmarks?
>
> Hm. There theoretically shouldn't actually be anything added in that
> path. Could you roughly sketch what that test is doing? Do you actually
> stream those changes out or did you just turn on wal_level=logical?

It was an update of a every row in a table of 720000 rows, with
each row updated by primary key using a separate UPDATE statement,
modifying a large text column with a lot of repeating characters
(so compressed well).  I got a timing on a master build and I got a
timing with the patch in the environment used by
test_logical_decoding.  It took several times as long in the latter
run, but it was very much a preliminary test in preparation for
getting real numbers.  (I'm sure you know how much work it is to
set up for a good run of tests.)  I'm not sure that (for example)
the synchronous_commit setting was the same, which could matter a
lot.  I wouldn't put a lot of stock in it until I can re-create it
under a much more controlled test.

The one thing about the whole episode that gave me pause was that
the compression and decompression routines were very high on the
`perf top` output in the patched run and way down the list on the
run based on master.  I don't have a ready explanation for that,
unless your branch was missing a recent commit for speeding
compression which was present on master.  It might be worth
checking that you're not detoasting more often than you need.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: changeset generation v5-01 - Patches & git tree

From
Andres Freund
Date:
On 2013-07-10 15:14:58 -0700, Kevin Grittner wrote:
> Andres Freund <andres@2ndquadrant.com> wrote:
> > Kevin Grittner <kgrittn@ymail.com> wrote:
> 
> >> (2)  An initial performance test didn't look very good.  I will be
> >> running a more controlled test to confirm but the logical
> >> replication of a benchmark with a lot of UPDATEs of compressed text
> >> values seemed to suffer with the logical replication turned on.
> >> Any suggestions or comments on that front, before I run the more
> >> controlled benchmarks?
> >
> > Hm. There theoretically shouldn't actually be anything added in that
> > path. Could you roughly sketch what that test is doing? Do you actually
> > stream those changes out or did you just turn on wal_level=logical?
> 
> It was an update of a every row in a table of 720000 rows, with
> each row updated by primary key using a separate UPDATE statement,
> modifying a large text column with a lot of repeating characters
> (so compressed well).  I got a timing on a master build and I got a
> timing with the patch in the environment used by
> test_logical_decoding.  It took several times as long in the latter
> run, but it was very much a preliminary test in preparation for
> getting real numbers.  (I'm sure you know how much work it is to
> set up for a good run of tests.)  I'm not sure that (for example)
> the synchronous_commit setting was the same, which could matter a
> lot.  I wouldn't put a lot of stock in it until I can re-create it
> under a much more controlled test.

So you didn't explicitly start anything to consume those changes?
I.e. using pg_receivellog or SELECT * FROM
start/init_logical_replication(...)?

Any chance there still was an old replication slot around?
SELECT * FROM pg_stat_logical_decoding;
should show them. But theoretically the make check in
test_logical_decoding should finish without one active...

> The one thing about the whole episode that gave me pause was that
> the compression and decompression routines were very high on the
> `perf top` output in the patched run and way down the list on the
> run based on master.

That's interesting. Unless there's something consuming the changestream
and the output plugin does something that actually requests
decompression of the Datums there shouldn't be *any* added/removed calls
to toast (de-)compression...
While consuming the changes there could be ReorderBufferToast* calls in
the profile. I haven't yet seem them in profiles, but that's not saying
all that much.

So:
> I don't have a ready explanation for that, unless your branch was
> missing a recent commit for speeding compression which was present on
> master.

It didn't have 031cc55bbea6b3a6b67c700498a78fb1d4399476 - but I can't
really imagine that making *such* a big difference. But maybe you hit
some sweet spot with the data?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: changeset generation v5-01 - Patches & git tree

From
Kevin Grittner
Date:
Andres Freund <andres@2ndquadrant.com> wrote:

> Any chance there still was an old replication slot around?

It is quite likely that there was.



--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: changeset generation v5-01 - Patches & git tree

From
Robert Haas
Date:
On Sun, Jul 7, 2013 at 4:34 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-07-07 15:43:17 -0400, Tom Lane wrote:
>> Andres Freund <andres@2ndquadrant.com> writes:
>> > 3b) Add catcache 'filter' that ensures the cache stays unique and use
>> >     that for the mapping
>>
>> > I slightly prefer 3b) because it's smaller, what's your opinions?
>>
>> This is just another variation on the theme of kluging the catcache to
>> do something it shouldn't.  You're still building a catcache on a
>> non-unique index, and that is going to lead to trouble.
>
> I don't think the lurking dangers really are present. The index
> essentially *is* unique since we filter away anything non-unique. The
> catcache code hardly can be confused by tuples it never sees. That would
> even work if we started preloading catcaches by doing scans of the
> entire underlying relation or by caching all of a page when reading one
> of its tuples.
>
> I can definitely see that there are "aesthetical" reasons against doing
> 3b), that's why I've also done 3a). So I'll chalk you up to voting for
> that...

I also vote for (3a).  I did a quick once over of 1, 2, and 3a and
they look reasonable.  Barring strenuous objections, I'd like to go
ahead and commit these, or perhaps an updated version of them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: changeset generation v5-01 - Patches & git tree

From
Robert Haas
Date:
On Tue, Jul 16, 2013 at 9:00 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Jul 7, 2013 at 4:34 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> On 2013-07-07 15:43:17 -0400, Tom Lane wrote:
>>> Andres Freund <andres@2ndquadrant.com> writes:
>>> > 3b) Add catcache 'filter' that ensures the cache stays unique and use
>>> >     that for the mapping
>>>
>>> > I slightly prefer 3b) because it's smaller, what's your opinions?
>>>
>>> This is just another variation on the theme of kluging the catcache to
>>> do something it shouldn't.  You're still building a catcache on a
>>> non-unique index, and that is going to lead to trouble.
>>
>> I don't think the lurking dangers really are present. The index
>> essentially *is* unique since we filter away anything non-unique. The
>> catcache code hardly can be confused by tuples it never sees. That would
>> even work if we started preloading catcaches by doing scans of the
>> entire underlying relation or by caching all of a page when reading one
>> of its tuples.
>>
>> I can definitely see that there are "aesthetical" reasons against doing
>> 3b), that's why I've also done 3a). So I'll chalk you up to voting for
>> that...
>
> I also vote for (3a).  I did a quick once over of 1, 2, and 3a and
> they look reasonable.  Barring strenuous objections, I'd like to go
> ahead and commit these, or perhaps an updated version of them.

Hearing no objections, I have done this.  Per off-list discussion with
Andres, I also included patch 4, which gives us regression test
coverage for this code, and have fixed a few bugs and a bunch of
stylistic things that bugged me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: changeset generation v5-01 - Patches & git tree

From
Robert Haas
Date:
On Fri, Jun 14, 2013 at 6:51 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> The git tree is at:
> git://git.postgresql.org/git/users/andresfreund/postgres.git branch xlog-decoding-rebasing-cf4
>
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-decoding-rebasing-cf4
>
> On 2013-06-15 00:48:17 +0200, Andres Freund wrote:
>> Overview of the attached patches:
>> 0001: indirect toast tuples; required but submitted independently
>> 0002: functions for testing; not required,
>> 0003: (tablespace, filenode) syscache; required
>> 0004: RelationMapFilenodeToOid: required, simple
>> 0005: pg_relation_by_filenode() function; not required but useful
>> 0006: Introduce InvalidCommandId: required, simple
>> 0007: Adjust Satisfies* interface: required, mechanical,
>> 0008: Allow walsender to attach to a database: required, needs review
>> 0009: New GetOldestXmin() parameter; required, pretty boring
>> 0010: Log xl_running_xact regularly in the bgwriter: required
>> 0011: make fsync_fname() public; required, needs to be in a different file
>> 0012: Relcache support for an Relation's primary key: required
>> 0013: Actual changeset extraction; required
>> 0014: Output plugin demo; not required (except for testing) but useful
>> 0015: Add pg_receivellog program: not required but useful
>> 0016: Add test_logical_decoding extension; not required, but contains
>>       the tests for the feature. Uses 0014
>> 0017: Snapshot building docs; not required

I've now also committed patch #7 from this series.  My earlier commit
fulfilled the needs of patches #3, #4, and #5; and somewhat longer ago
I committed #1.  I am not entirely convinced of the necessity or
desirability of patch #6, but as of now I haven't studied the issues
closely.  Patch #2 does not seem useful in isolation; it adds new
regression-testing stuff but doesn't use it anywhere.

I doubt that any of the remaining patches (#8-#17) can be applied
separately without understanding the shape of the whole patch set, so
I think I, or someone else, will need to set aside more time for
detailed review before proceeding further with this patch set.  I
suggest that we close out the CommitFest entry for this patch set one
way or another, as there is no way we're going to get the whole thing
done under the auspices of CF1.

I'll try to find some more time to spend on this relatively soon, but
I think this is about as far as I can take this today.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: changeset generation v5-01 - Patches & git tree

From
Andres Freund
Date:
On 2013-07-22 13:50:08 -0400, Robert Haas wrote:
> On Fri, Jun 14, 2013 at 6:51 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > The git tree is at:
> > git://git.postgresql.org/git/users/andresfreund/postgres.git branch xlog-decoding-rebasing-cf4
> >
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-decoding-rebasing-cf4
> >
> > On 2013-06-15 00:48:17 +0200, Andres Freund wrote:
> >> Overview of the attached patches:
> >> 0001: indirect toast tuples; required but submitted independently
> >> 0002: functions for testing; not required,
> >> 0003: (tablespace, filenode) syscache; required
> >> 0004: RelationMapFilenodeToOid: required, simple
> >> 0005: pg_relation_by_filenode() function; not required but useful
> >> 0006: Introduce InvalidCommandId: required, simple
> >> 0007: Adjust Satisfies* interface: required, mechanical,
> >> 0008: Allow walsender to attach to a database: required, needs review
> >> 0009: New GetOldestXmin() parameter; required, pretty boring
> >> 0010: Log xl_running_xact regularly in the bgwriter: required
> >> 0011: make fsync_fname() public; required, needs to be in a different file
> >> 0012: Relcache support for an Relation's primary key: required
> >> 0013: Actual changeset extraction; required
> >> 0014: Output plugin demo; not required (except for testing) but useful
> >> 0015: Add pg_receivellog program: not required but useful
> >> 0016: Add test_logical_decoding extension; not required, but contains
> >>       the tests for the feature. Uses 0014
> >> 0017: Snapshot building docs; not required
> 
> I've now also committed patch #7 from this series.  My earlier commit
> fulfilled the needs of patches #3, #4, and #5; and somewhat longer ago
> I committed #1.

Thanks!

> I am not entirely convinced of the necessity or
> desirability of patch #6, but as of now I haven't studied the issues
> closely.

Fair enough. It's certainly possible to work around not having it, but
it seems cleaner to introduce the notion of an invalid CommandId like we
have for transaction ids et al.
Allowing 2^32-2 instead of 2^32-1 subtransactions doesn't seem like a
problem to me ;)

> Patch #2 does not seem useful in isolation; it adds new
> regression-testing stuff but doesn't use it anywhere.

Yes. I found it useful to test stuff around making replication
synchronous or such, but while I think we should have a facility like it
in core for both, logical and physical replication, I don't think this
patch is ready for prime time due to it's busy looping. I've even marked
it as such above ;)
My first idea to properly implement that seems to be to reuse the
syncrep infrastructure but that doesn't look trivial.

> I doubt that any of the remaining patches (#8-#17) can be applied
> separately without understanding the shape of the whole patch set, so
> I think I, or someone else, will need to set aside more time for
> detailed review before proceeding further with this patch set.  I
> suggest that we close out the CommitFest entry for this patch set one
> way or another, as there is no way we're going to get the whole thing
> done under the auspices of CF1.

Generally agreed. The biggest chunk of the code is in #13 anyway...

Some may be applyable independently:

> 0010: Log xl_running_xact regularly in the bgwriter: required   Should be useful independently since it can
significantlyspeed up   startup of physical replicas. Ony many systems checkpoint_timeout   will be set to an hour
whichcan make the time till a standby gets   consistent be quite high since that will be first time it sees a
xl_running_xactsagain.
 

> 0011: make fsync_fname() public; required, needs to be in a different file   Isn't in the shape for it atm, but could
beapplied as an   independent infrastructure patch. And it should be easy enough to   clean it up.
 

> 0012: Relcache support for an Relation's primary key: required   Might actually be a good idea independently as well.
E.g.the   materalized key patch could use the information that there's a   candidate key around to avoid a good bit of
uselesswork.
 


> I'll try to find some more time to spend on this relatively soon, but
> I think this is about as far as I can take this today.

Was pretty helpful already, so ... ;)

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: changeset generation v5-01 - Patches & git tree

From
Alvaro Herrera
Date:
Andres Freund wrote:
> The git tree is at:
> git://git.postgresql.org/git/users/andresfreund/postgres.git branch xlog-decoding-rebasing-cf4
>
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-decoding-rebasing-cf4

I gave this recently rebased branch a skim.  In general, the separation 
between decode.c/reorderbuffer.c/snapbuild.c seems a lot nicer now than
on previous iterations -- good job there.

Here are some quick notes I took while reading the patch itself.  I
haven't gone through it really carefully, yet.


- I wonder whether DecodeCommit and DecodeAbort should really be a single routine.  Right now, the former might call
thelater; and the latter is aware of this.  Seems awkward.
 

- We skip insert/update/delete if not my database Id; however, we don't skip commit in the same case.  If there are two
walrecvrson a cluster, on different databases, does this lead to us trying to remove files twice, if a xact commits
whichdeleted some files?  Is this a problem? Should we try to skip such database-specific actions in global WAL
records?

- There's rmgr-specific knowledge in decode.c.  I wonder if, similar to redo and desc routines, that shouldn't instead
bepluggable functions for each rmgr.
 

- What's with ReorderBufferRestoreCleanup()?  Shouldn't it be in logical.c?

- reorderbuffer.c does several different things.  Can it be split? Perhaps in pieces such as * stuff to manage memory
(slabcache thingies) * TXN iterator * other logically separate parts? * the rest
 

- Having to expose LocalExecuteInvalidationMessage() looks awkward.  Is there another way?

- I think we need a better name for "treat_as_catalog_table" (and RelationIsTreatedAsCatalogTable).  Maybe
replication_catalogor something similar?
 

- Don't do this: + * RecentGlobal(Data)?Xmin is initialized to InvalidTransactionId, to ensure that no because later
grepsfor RecentGlobalDataXmin and RecentGlobalXmin will fail to find it.  It seems better to spell both names, so
"RecentGlobalDataXminand RecentGlobalXmin are initialized to ..."
 

- the pg_receivellog command line is strange.  Apparently I need one or more of --start,--init,--stop, but if stop,
thenthe other two must not be present; and if startpos, then init and stop cannot be specified.  (There's a typo there
thatsays "cannot combine with --start" when it really means "cannot combine with --stop", BTW).  I think this would
makemore sense to have init,start,stop be commands, in pg_ctl's spirit; so there would be no double-dash.  IOW
SOMEPATH/pg_receivellog--startpos=123 start and so on.  Also, we need SGML docs for this new utility.
 


Any particular reason for removing this line:
-/* Get a new XLogReader */
+extern XLogReaderState *XLogReaderAllocate(XLogPageReadCB pagereadfunc,                  void *private_data);


Typo here (2n*d*Quadrant):
+= Snapshot Building =
+:author: Andres Freund, 2nQuadrant Ltd



I don't see the point of XLogRecordBuffer.record_data; we already have a
pointer to the XLogRecord, and the data can readily be obtained using
XLogRecGetData.  So why provide the same thing twice?  It seems to me
that if instead of passing the XLogRecordBuffer we just provide the
XLogRecord, and separately the "origptr" where needed, we could avoid
having to expose the XLogRecordBuffer stuff unnecessarily.


In this comment:
+ * FIXME: We need something resembling the real SnapshotNow to handle things
+ * like enum lookups from indices correctly.
what do we need consider in light of the new comment proposed by Robert
CA+TgmobvTjRj_doXxQ0wgA1a1JLYPVYqtR3m+Cou_ousabnmXg@mail.gmail.com

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: changeset generation v5-01 - Patches & git tree

From
Andres Freund
Date:
On 2013-08-27 11:32:30 -0400, Alvaro Herrera wrote:
> Andres Freund wrote:
> > The git tree is at:
> > git://git.postgresql.org/git/users/andresfreund/postgres.git branch xlog-decoding-rebasing-cf4
> >
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-decoding-rebasing-cf4
> 
> I gave this recently rebased branch a skim.  In general, the separation 
> between decode.c/reorderbuffer.c/snapbuild.c seems a lot nicer now than
> on previous iterations -- good job there.

Thanks for having a look!

> Here are some quick notes I took while reading the patch itself.  I
> haven't gone through it really carefully, yet.
> 
> 
> - I wonder whether DecodeCommit and DecodeAbort should really be a single
>   routine.  Right now, the former might call the later; and the latter is
>   aware of this.  Seems awkward.

Yes, I am not happy with that either. I'll play with combining them and
check whether that looks beter.

> - We skip insert/update/delete if not my database Id; however, we don't skip
>   commit in the same case.  If there are two walrecvrs on a cluster, on
>   different databases, does this lead to us trying to remove files
>   twice, if a xact commits which deleted some files?  Is this a problem?
>   Should we try to skip such database-specific actions in global
>   WAL records?

Hm. We should be able to skip it for long commit records at least. I
think I lost that code along the unification.

There's no danger of removing anything global afaics since we're not
replaying using the original replay routines and all the slot/sender
specific stuff has unique names.

> - There's rmgr-specific knowledge in decode.c.  I wonder if, similar to
>   redo and desc routines, that shouldn't instead be pluggable functions
>   for each rmgr.

I don't think that's a good idea. I've quickly played with it before and
it doesn't seem to end happy. It would require opening up more
semi-public interfaces and in the end, we're only interested of in-core
stuff. Even if it were possible to add new indexes by plugging new
rmgrs, we wouldn't care.

> - What's with ReorderBufferRestoreCleanup()?  Shouldn't it be in logical.c?

No, that's just for removing ondisk data at the end of a
transaction. I'll improve the comment.

> - reorderbuffer.c does several different things.  Can it be split?
>   Perhaps in pieces such as
>   * stuff to manage memory (slab cache thingies)
>   * TXN iterator
>   * other logically separate parts?
>   * the rest

Hm. I don't really see much point in splitting it along those
lines. None of those really makes sense without the other parts and the
file isn't *that* huge.

> - Having to expose LocalExecuteInvalidationMessage() looks awkward.  Is there
>   another way?

Hm. I don't immediately see any way. We need to execute invalidation
messages just within one backend. There just is no exposed functionality
for that yet since it wasn't needed so far. We could expose something
like LocalExecuteInvalidationMessage*s*() instead of doing the loop in
reorderbuffer.c, but that's about it.

> - I think we need a better name for "treat_as_catalog_table" (and
>   RelationIsTreatedAsCatalogTable).  Maybe replication_catalog or
>   something similar?

I think we're going to end up needing that for more than just
replication, so I'd like to keep replication out of the name. I don't
like the current name either though, so any other ideas?

> - Don't do this:
>   + * RecentGlobal(Data)?Xmin is initialized to InvalidTransactionId, to ensure that no
>   because later greps for RecentGlobalDataXmin and RecentGlobalXmin will
>   fail to find it.  It seems better to spell both names, so
>   "RecentGlobalDataXmin and RecentGlobalXmin are initialized to ..."

Ok.

> - the pg_receivellog command line is strange.  Apparently I need one or
>   more of --start,--init,--stop, but if stop, then the other two must
>   not be present; and if startpos, then init and stop cannot be
>   specified.  (There's a typo there that says "cannot combine with
>   --start" when it really means "cannot combine with --stop", BTW).  I
>   think this would make more sense to have init,start,stop be commands,
>   in pg_ctl's spirit; so there would be no double-dash.  IOW
>     SOMEPATH/pg_receivellog --startpos=123 start
>   and so on.

The reasoning here is somewhat complex and I am not happy with the
status quo, so I like getting input here.

The individual verbs mean:
* init: create a replication slot
* start: continue streaming in an existing replication slot
* stop: remove replication slot

The reason you cannot specify anything with --stop is that a) --start
streams until you abort the utility. So there's no chance of running
--stop after it. b) --init and --stop seems like a pointless combination
since you cannot actually do anything with the slot.
--init and --start combined, on the other hand are useful for testing,
which is why I allow them so far, but I wouldn't have problems removing
that capability.

The reason you cannot combine --init or --init --start with --startpos
is that --startpos has to refer to a location that could have actually
streamed to the client. Before a replication slot is established the
client doesn't know anything about such an address, so --init --start
cannot know any useful --startpos, that's why it's forbidden to pass
one.

The idea behind startpos is that you can tell the server "I have
replayed transactions up to this LSN" and the server will only give you
only transactions that have commited after this.

>  Also, we need SGML docs for this new utility.

And a lot more than only for this utility :(

> Any particular reason for removing this line:
> -/* Get a new XLogReader */
> +
>  extern XLogReaderState *XLogReaderAllocate(XLogPageReadCB pagereadfunc,
>                    void *private_data);

Hrmpf. Merge error. I've integrated too many different versions of too
different xlogreaders ;)

> I don't see the point of XLogRecordBuffer.record_data; we already have a
> pointer to the XLogRecord, and the data can readily be obtained using
> XLogRecGetData.  So why provide the same thing twice?  It seems to me
> that if instead of passing the XLogRecordBuffer we just provide the
> XLogRecord, and separately the "origptr" where needed, we could avoid
> having to expose the XLogRecordBuffer stuff unnecessarily.

By now we also need the end location of a wal record. So we have to pass
three addresses around for everything which isn't very convenient. If
you vastly prefer passing around three parameters I can do that, but I'd
rather not.
The original reason for doing so was, to be honest, that my own
xlogreader's API was different...

> In this comment:
> + * FIXME: We need something resembling the real SnapshotNow to handle things
> + * like enum lookups from indices correctly.
> what do we need consider in light of the new comment proposed by Robert
> CA+TgmobvTjRj_doXxQ0wgA1a1JLYPVYqtR3m+Cou_ousabnmXg@mail.gmail.com

I did most of the code changes for this, but this made me realize that
there are quite some more comments and even a function name to be
adapted. Will work on that.

Thanks!

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v5

From
Andres Freund
Date:
Hi,

I've attached a couple of the preliminary patches to $subject which I've
recently cleaned up in the hope that we can continue improving on those
in a piecemal fashion.
I am preparing submission of a newer version of the major patch but
unfortunately progress on that is slower than I'd like...

In the order of chance of applying them individuall they are:

0005 wal_decoding: Log xl_running_xact's at a higher frequency than checkpoints are done
* benefits hot standby startup
0003 wal_decoding: Allow walsender's to connect to a specific database
* biggest problem is how to specify the connection we connect
  to. Currently with the patch walsender connects to a database if it's
  not named "replication" (via dbname). Perhaps it's better to invent a
  replication_dbname parameter?
0006 wal_decoding: copydir: move fsync_fname to fd.[c.h] and make it public
* Pretty trivial and boring.
0007 wal_decoding: Add information about a tables primary key to struct RelationData
* Could be used in the matview refresh code
0002 wal_decoding: Introduce InvalidCommandId and declare that to be the new maximum for CommandCounterIncrement

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: logical changeset generation v5

From
Robert Haas
Date:
On Fri, Aug 30, 2013 at 11:19 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> 0005 wal_decoding: Log xl_running_xact's at a higher frequency than checkpoints are done
> * benefits hot standby startup

Review:

1. I think more comments are needed here to explain why we need this.
I don't know if the comments should go into the functions modified by
this patch or in some other location, but I don't find what's here now
adequate for understanding.

2. I think the variable naming could be better.  If nothing else, I'd
spell out "snapshot" rather than abbreviating it to "snap".  I'd also
add comments explaining what each of those variables does.  And why
isn't log_snap_interval_ms a #define rather than a variable?  (Don't
even talk to me about using gdb on a running instance.  If you're even
thinking about that, this needs to be a GUC.)

3. Why does LogCurrentRunningXacts() need to call
XLogSetAsyncXactLSN()?  Hopefully any WAL record is going to get
sync'd in a reasonably timely fashion; I can't see off-hand why this
one should need special handling.

> 0003 wal_decoding: Allow walsender's to connect to a specific database
> * biggest problem is how to specify the connection we connect
>   to. Currently with the patch walsender connects to a database if it's
>   not named "replication" (via dbname). Perhaps it's better to invent a
>   replication_dbname parameter?

I understand why logical replication needs to connect to a database,
but I don't understand why any other walsender would need to connect
to a database.  Absent a clear use case for such a thing, I don't
think we should allow it.  Ignorant suggestion: perhaps the database
name could be stored in the logical replication slot.

> 0006 wal_decoding: copydir: move fsync_fname to fd.[c.h] and make it public
> * Pretty trivial and boring.

Seems fine.

> 0007 wal_decoding: Add information about a tables primary key to struct RelationData
> * Could be used in the matview refresh code

I think you and Kevin should discuss whether this is actually the
right way to do this.  ISTM that if logical replication and
materialized views end up selecting different approaches to this
problem, everybody loses.

> 0002 wal_decoding: Introduce InvalidCommandId and declare that to be the new maximum for CommandCounterIncrement

I'm still unconvinced we want this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v5

From
Andres Freund
Date:
On 2013-09-03 11:40:57 -0400, Robert Haas wrote:
> On Fri, Aug 30, 2013 at 11:19 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > 0005 wal_decoding: Log xl_running_xact's at a higher frequency than checkpoints are done
> > * benefits hot standby startup
> 
> Review:
> 
> 1. I think more comments are needed here to explain why we need this.
> I don't know if the comments should go into the functions modified by
> this patch or in some other location, but I don't find what's here now
> adequate for understanding.

Hm. What information are you actually missing? I guess the
XLogSetAsyncXactLSN() needs a bit more context based on your question,
what else?
Not sure if it makes sense to explain in detail why it helps us to get
into a consistent state faster?

> 2. I think the variable naming could be better.  If nothing else, I'd
> spell out "snapshot" rather than abbreviating it to "snap".  I'd also
> add comments explaining what each of those variables does.

Ok.

> And why
> isn't log_snap_interval_ms a #define rather than a variable?  (Don't
> even talk to me about using gdb on a running instance.  If you're even
> thinking about that, this needs to be a GUC.)

Ugh. It certainly doesn't have anything to do with wanting to change it
on a running system using gdb. Brrr.

I think I wanted it to be a constant variable but forgot the const. I
personally prefer 'static const' to #define's if its legal C, but I
guess the project's style differs, so I'll change that.

> 3. Why does LogCurrentRunningXacts() need to call
> XLogSetAsyncXactLSN()?  Hopefully any WAL record is going to get
> sync'd in a reasonably timely fashion; I can't see off-hand why this
> one should need special handling.

No, we don't force writing out wal records in a timely fashion if
there's no pressure in wal_buffers, basically only on commits and
various XLogFlush()es. It doesn't make much of a difference if the
entire system is busy, but if it's not the wal writer will sleep. The
alternative would be to XLogFlush() the record, but that would actually
block, which isn't really what we want/need.

> > 0003 wal_decoding: Allow walsender's to connect to a specific database
> > * biggest problem is how to specify the connection we connect
> >   to. Currently with the patch walsender connects to a database if it's
> >   not named "replication" (via dbname). Perhaps it's better to invent a
> >   replication_dbname parameter?

> I understand why logical replication needs to connect to a database,
> but I don't understand why any other walsender would need to connect
> to a database.

Well, logical replication actually streams out data using the walsender,
so that's the reason why I want to add it there. But there have been
cases in the past where we wanted to do stuff in the walsender that need
database access, but we couldn't do so because you cannot connect to
one.

>  Absent a clear use case for such a thing, I don't
> think we should allow it.  Ignorant suggestion: perhaps the database
> name could be stored in the logical replication slot.

The problem is that you need to InitPostgres() with a database. You
cannot do that again, after connecting with an empty database which we
do in a plain walsender.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v5

From
Robert Haas
Date:
On Tue, Sep 3, 2013 at 12:05 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> 1. I think more comments are needed here to explain why we need this.
>> I don't know if the comments should go into the functions modified by
>> this patch or in some other location, but I don't find what's here now
>> adequate for understanding.
>
> Hm. What information are you actually missing? I guess the
> XLogSetAsyncXactLSN() needs a bit more context based on your question,
> what else?
> Not sure if it makes sense to explain in detail why it helps us to get
> into a consistent state faster?

Well, we must have had some idea in mind when the original Hot Standby
patch went in that doing this once per checkpoint was good enough.
Now we think we need it every 15 seconds, but not more or less often.
So, why the change of heart?  To my way of thinking, it seems as
though we ought to always begin replay at a checkpoint, so the standby
ought always to see one of these records immediately.  Obviously
that's not good enough, but why not?  And why is every 15 seconds good
enough?

>> 3. Why does LogCurrentRunningXacts() need to call
>> XLogSetAsyncXactLSN()?  Hopefully any WAL record is going to get
>> sync'd in a reasonably timely fashion; I can't see off-hand why this
>> one should need special handling.
>
> No, we don't force writing out wal records in a timely fashion if
> there's no pressure in wal_buffers, basically only on commits and
> various XLogFlush()es. It doesn't make much of a difference if the
> entire system is busy, but if it's not the wal writer will sleep. The
> alternative would be to XLogFlush() the record, but that would actually
> block, which isn't really what we want/need.

The WAL writer is supposed to call XLogBackgroundFlush() every time
WalWriterDelay expires.  Yeah, it can hibernate, but if it's
hibernating, then we should respect that decision for this WAL record
type also.

>> > 0003 wal_decoding: Allow walsender's to connect to a specific database
>> > * biggest problem is how to specify the connection we connect
>> >   to. Currently with the patch walsender connects to a database if it's
>> >   not named "replication" (via dbname). Perhaps it's better to invent a
>> >   replication_dbname parameter?
>
>> I understand why logical replication needs to connect to a database,
>> but I don't understand why any other walsender would need to connect
>> to a database.
>
> Well, logical replication actually streams out data using the walsender,
> so that's the reason why I want to add it there. But there have been
> cases in the past where we wanted to do stuff in the walsender that need
> database access, but we couldn't do so because you cannot connect to
> one.

Could you be more specific?

>>  Absent a clear use case for such a thing, I don't
>> think we should allow it.  Ignorant suggestion: perhaps the database
>> name could be stored in the logical replication slot.
>
> The problem is that you need to InitPostgres() with a database. You
> cannot do that again, after connecting with an empty database which we
> do in a plain walsender.

Are you saying that the logical replication slot can't be read before
calling InitPostgres()?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v5

From
Andres Freund
Date:
On 2013-09-03 12:22:22 -0400, Robert Haas wrote:
> On Tue, Sep 3, 2013 at 12:05 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> 1. I think more comments are needed here to explain why we need this.
> >> I don't know if the comments should go into the functions modified by
> >> this patch or in some other location, but I don't find what's here now
> >> adequate for understanding.
> >
> > Hm. What information are you actually missing? I guess the
> > XLogSetAsyncXactLSN() needs a bit more context based on your question,
> > what else?
> > Not sure if it makes sense to explain in detail why it helps us to get
> > into a consistent state faster?

> Well, we must have had some idea in mind when the original Hot Standby
> patch went in that doing this once per checkpoint was good enough.
> Now we think we need it every 15 seconds, but not more or less often.
> So, why the change of heart?

I think the primary reason for that was that it's was a pretty
complicated patchset and we needed to start somewhere. By now we do have
reports of standbys taking their time to get consistent.

> To my way of thinking, it seems as though we ought to always begin
> replay at a checkpoint, so the standby ought always to see one of
> these records immediately.  Obviously that's not good enough, but why
> not?

We always see one after the checkpoint (well, actually before the
checkpoint record, but ...), correct. The problem is just that reading a
single xact_running record doesn't automatically make you consistent. If
there's a single suboverflowed transaction running on the primary when
the xl_runing_xacts is logged we won't be able to switch to
consistent. Check procarray.c:ProcArrayApplyRecoveryInfo() for some fun
and some optimizations.
Since the only place where we currently have the information to
potentially become consistent is ProcArrayApplyRecoveryInfo() we will
have to wait checkpoint_timeout time till we get consistent. Which
sucks as there are good arguments to set that to 1h.
That especially sucks as you loose consistency everytime you restart the
standby...

> And why is every 15 seconds good enough?

Waiting 15s to become consistent instead of checkpoint_timeout seems to
be ok to me and to be a good tradeoff between overhead and waiting. We
can certainly discuss other values or making it configurable. The latter
seemed to be unnecessary to me, but I have don't have a problem
implementing it. I just don't want to document it :P

> >> 3. Why does LogCurrentRunningXacts() need to call
> >> XLogSetAsyncXactLSN()?  Hopefully any WAL record is going to get
> >> sync'd in a reasonably timely fashion; I can't see off-hand why this
> >> one should need special handling.
> >
> > No, we don't force writing out wal records in a timely fashion if
> > there's no pressure in wal_buffers, basically only on commits and
> > various XLogFlush()es. It doesn't make much of a difference if the
> > entire system is busy, but if it's not the wal writer will sleep. The
> > alternative would be to XLogFlush() the record, but that would actually
> > block, which isn't really what we want/need.

> The WAL writer is supposed to call XLogBackgroundFlush() every time
> WalWriterDelay expires.  Yeah, it can hibernate, but if it's
> hibernating, then we should respect that decision for this WAL record
> type also.

Why should we respect it? There is work to be done and the wal writer
has no way of knowing that without us telling it? Normally we rely on
commit records and XLogFlush()es to trigger the wal writer.
Alternatively we can start a transaction and set synchronous_commit =
off, but that seems like a complication to me.

> >> I understand why logical replication needs to connect to a database,
> >> but I don't understand why any other walsender would need to connect
> >> to a database.
> >
> > Well, logical replication actually streams out data using the walsender,
> > so that's the reason why I want to add it there. But there have been
> > cases in the past where we wanted to do stuff in the walsender that need
> > database access, but we couldn't do so because you cannot connect to
> > one.

> Could you be more specific?

I only remember 3959.1349384333@sss.pgh.pa.us but I think it has come up
before.

> >>  Absent a clear use case for such a thing, I don't
> >> think we should allow it.  Ignorant suggestion: perhaps the database
> >> name could be stored in the logical replication slot.
> >
> > The problem is that you need to InitPostgres() with a database. You
> > cannot do that again, after connecting with an empty database which we
> > do in a plain walsender.
> 
> Are you saying that the logical replication slot can't be read before
> calling InitPostgres()?

The slot can be read just fine, but we won't know that we should do
that. Walsender accepts commands via PostgresMain()'s mainloop which has
done a InitPostgres(dbname) before. Which we need to do because we need
the environment it sets up.

The database is stored in the slots btw (as oid, not as a name though) ;)

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v5

From
Robert Haas
Date:
On Tue, Sep 3, 2013 at 12:57 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> To my way of thinking, it seems as though we ought to always begin
>> replay at a checkpoint, so the standby ought always to see one of
>> these records immediately.  Obviously that's not good enough, but why
>> not?
>
> We always see one after the checkpoint (well, actually before the
> checkpoint record, but ...), correct. The problem is just that reading a
> single xact_running record doesn't automatically make you consistent. If
> there's a single suboverflowed transaction running on the primary when
> the xl_runing_xacts is logged we won't be able to switch to
> consistent. Check procarray.c:ProcArrayApplyRecoveryInfo() for some fun
> and some optimizations.
> Since the only place where we currently have the information to
> potentially become consistent is ProcArrayApplyRecoveryInfo() we will
> have to wait checkpoint_timeout time till we get consistent. Which
> sucks as there are good arguments to set that to 1h.
> That especially sucks as you loose consistency everytime you restart the
> standby...

Right, OK.

>> And why is every 15 seconds good enough?
>
> Waiting 15s to become consistent instead of checkpoint_timeout seems to
> be ok to me and to be a good tradeoff between overhead and waiting. We
> can certainly discuss other values or making it configurable. The latter
> seemed to be unnecessary to me, but I have don't have a problem
> implementing it. I just don't want to document it :P

I don't think it particularly needs to be configurable, but I wonder
if we can't be a bit smarter about when we do it.  For example,
suppose we logged it every 15 s but only until we log a non-overflowed
snapshot.  I realize that the overhead of a WAL record every 15
seconds is fairly small, but the load on some systems is all but
nonexistent.  It would be nice not to wake up the HD unnecessarily.

>> The WAL writer is supposed to call XLogBackgroundFlush() every time
>> WalWriterDelay expires.  Yeah, it can hibernate, but if it's
>> hibernating, then we should respect that decision for this WAL record
>> type also.
>
> Why should we respect it?

Because I don't see any reason to believe that this WAL record is any
more important or urgent than any other WAL record that might get
logged.

>> >> I understand why logical replication needs to connect to a database,
>> >> but I don't understand why any other walsender would need to connect
>> >> to a database.
>> >
>> > Well, logical replication actually streams out data using the walsender,
>> > so that's the reason why I want to add it there. But there have been
>> > cases in the past where we wanted to do stuff in the walsender that need
>> > database access, but we couldn't do so because you cannot connect to
>> > one.
>
>> Could you be more specific?
>
> I only remember 3959.1349384333@sss.pgh.pa.us but I think it has come up
> before.

It seems we need some more design there.  Perhaps entering replication
mode could be triggered by writing either dbname=replication or
replication=yes.  But then, do the replication commands simply become
SQL commands?  I've certainly seen hackers use them that way.  And I
can imagine that being a sensible approach, but this patch seems like
it's only covering a fairly small fraction of what really ought to be
a single commit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v5

From
Andres Freund
Date:
On 2013-09-03 15:56:15 -0400, Robert Haas wrote:
> On Tue, Sep 3, 2013 at 12:57 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> And why is every 15 seconds good enough?
> >
> > Waiting 15s to become consistent instead of checkpoint_timeout seems to
> > be ok to me and to be a good tradeoff between overhead and waiting. We
> > can certainly discuss other values or making it configurable. The latter
> > seemed to be unnecessary to me, but I have don't have a problem
> > implementing it. I just don't want to document it :P
> 
> I don't think it particularly needs to be configurable, but I wonder
> if we can't be a bit smarter about when we do it.  For example,
> suppose we logged it every 15 s but only until we log a non-overflowed
> snapshot.

There's actually more benefits than just overflowed snapshots (pruning
of the known xids machinery, exclusive lock cleanup).

> I realize that the overhead of a WAL record every 15
> seconds is fairly small, but the load on some systems is all but
> nonexistent.  It would be nice not to wake up the HD unnecessarily.

The patch as-is only writes if there has been WAL written since the last
time it logged a running_xacts. I think it's not worth building more
smarts than that?

> >> The WAL writer is supposed to call XLogBackgroundFlush() every time
> >> WalWriterDelay expires.  Yeah, it can hibernate, but if it's
> >> hibernating, then we should respect that decision for this WAL record
> >> type also.
> >
> > Why should we respect it?
> 
> Because I don't see any reason to believe that this WAL record is any
> more important or urgent than any other WAL record that might get
> logged.

I can't follow the logic behind that statement. Just about all WAL
records are either pretty immediately flushed afterwards or are done in
the context of a transaction where we flush (or do a
XLogSetAsyncXactLSN) at transaction commit.

XLogBackgroundFlush() won't necessarily flush the running_xacts
record. Unless you've set the async xact lsn it will only flush complete
blocks. So what can happen (I've seen that more than once in testing,
took me a while to debug) that a checkpoint is started in a busy period
but nothing happens after it finished. Since the checkpoint triggered
running_xact is triggered *before* we do the smgr flush it still is
overflowed. Then, after activity has died down, the bgwriter issues the
running xact record, but it's filling a block and thus it never get's
flushed.

To me the alternatives are to do a XLogSetAsyncXactLSN() or an
XLogFlush(). The latter is more aggressive and can block for a
measurable amount of time, which is why I don't want to do it in the
bgwriter.

> >> >> I understand why logical replication needs to connect to a database,
> >> >> but I don't understand why any other walsender would need to connect
> >> >> to a database.
> >> >
> >> > Well, logical replication actually streams out data using the walsender,
> >> > so that's the reason why I want to add it there. But there have been
> >> > cases in the past where we wanted to do stuff in the walsender that need
> >> > database access, but we couldn't do so because you cannot connect to
> >> > one.
> >
> >> Could you be more specific?
> >
> > I only remember 3959.1349384333@sss.pgh.pa.us but I think it has come up
> > before.
> 
> It seems we need some more design there.  Perhaps entering replication
> mode could be triggered by writing either dbname=replication or
> replication=yes.  But then, do the replication commands simply become
> SQL commands?  I've certainly seen hackers use them that way.  And I
> can imagine that being a sensible approach, but this patch seems like
> it's only covering a fairly small fraction of what really ought to be
> a single commit.

Yes. I think you're right that we need more input/design here. I've
previously started threads about it, but nobody replied :(.

The problem with using dbname=replication as a trigger for anything is
that we actually allow databases to be created with that name. Perhaps
that was a design mistake.

I wondered about turning replication from a boolean into something like
off|0, on|1, database. dbname= gets only used in the latter
variant. That would be compatible with previous versions and would even
support using old tools (since all of them seem to do replication=1).

> But then, do the replication commands simply become
> SQL commands?  I've certainly seen hackers use them that way.

I don't think that it's a good way at this point to make them to plain
SQL. There is more infrastructure (signal handlers, permissions,
different timeouts) & memory required for walsenders, so using plain SQL
there seems beyond the scope of this.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v5

From
Robert Haas
Date:
On Tue, Sep 3, 2013 at 7:10 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> I don't think it particularly needs to be configurable, but I wonder
>> if we can't be a bit smarter about when we do it.  For example,
>> suppose we logged it every 15 s but only until we log a non-overflowed
>> snapshot.
>
> There's actually more benefits than just overflowed snapshots (pruning
> of the known xids machinery, exclusive lock cleanup).

I know that, but I thought the master and slave could only lose sync
on those things after a master crash and that once per checkpoint
cycle was enough for those other benefits.  Am I wrong?

> The patch as-is only writes if there has been WAL written since the last
> time it logged a running_xacts. I think it's not worth building more
> smarts than that?

Hmm, maybe.

>> Because I don't see any reason to believe that this WAL record is any
>> more important or urgent than any other WAL record that might get
>> logged.
>
> I can't follow the logic behind that statement. Just about all WAL
> records are either pretty immediately flushed afterwards or are done in
> the context of a transaction where we flush (or do a
> XLogSetAsyncXactLSN) at transaction commit.
>
> XLogBackgroundFlush() won't necessarily flush the running_xacts
> record.

OK, this was the key point I was missing.

>> It seems we need some more design there.  Perhaps entering replication
>> mode could be triggered by writing either dbname=replication or
>> replication=yes.  But then, do the replication commands simply become
>> SQL commands?  I've certainly seen hackers use them that way.  And I
>> can imagine that being a sensible approach, but this patch seems like
>> it's only covering a fairly small fraction of what really ought to be
>> a single commit.
>
> Yes. I think you're right that we need more input/design here. I've
> previously started threads about it, but nobody replied :(.
>
> The problem with using dbname=replication as a trigger for anything is
> that we actually allow databases to be created with that name. Perhaps
> that was a design mistake.

It seemed like a good idea at the time, but maybe it wasn't.  I'm not
sure where to go with it at this point; a forcible backward
compatibility break would probably screw things up for a lot of
people.

> I wondered about turning replication from a boolean into something like
> off|0, on|1, database. dbname= gets only used in the latter
> variant. That would be compatible with previous versions and would even
> support using old tools (since all of them seem to do replication=1).

I don't love that, but I don't hate it, either.  But it still doesn't
answer the following question, which I think is important: if I (or
someone else) commits this patch, how will that make things better for
users?  At the moment it's just adding a knob that doesn't do anything
for you when you twist it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v5

From
Andres Freund
Date:
On 2013-09-04 10:02:05 -0400, Robert Haas wrote:
> On Tue, Sep 3, 2013 at 7:10 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> I don't think it particularly needs to be configurable, but I wonder
> >> if we can't be a bit smarter about when we do it.  For example,
> >> suppose we logged it every 15 s but only until we log a non-overflowed
> >> snapshot.
> >
> > There's actually more benefits than just overflowed snapshots (pruning
> > of the known xids machinery, exclusive lock cleanup).

> I know that, but I thought the master and slave could only lose sync
> on those things after a master crash and that once per checkpoint
> cycle was enough for those other benefits.  Am I wrong?

The xid tracking can keep track without the additional records but it
sometimes needs a good bit more memory to do so if the primary burns to
xids quite fast.  Everytime we see an running xacts record we can do
cleanup (that's the ExpireOldKnownAssignedTransactionIds() in
ProcArrayApplyRecoveryInfo()).

> > The problem with using dbname=replication as a trigger for anything is
> > that we actually allow databases to be created with that name. Perhaps
> > that was a design mistake.
> 
> It seemed like a good idea at the time, but maybe it wasn't.  I'm not
> sure where to go with it at this point; a forcible backward
> compatibility break would probably screw things up for a lot of
> people.

Yes, breaking things now doesn't seem like a good idea.

> > I wondered about turning replication from a boolean into something like
> > off|0, on|1, database. dbname= gets only used in the latter
> > variant. That would be compatible with previous versions and would even
> > support using old tools (since all of them seem to do replication=1).
> 
> I don't love that, but I don't hate it, either.

Ok. Will update the patch that way. Seems better than it's current state.

> But it still doesn't
> answer the following question, which I think is important: if I (or
> someone else) commits this patch, how will that make things better for
> users?  At the moment it's just adding a knob that doesn't do anything
> for you when you twist it.

I am not sure it's a good idea to commit it before we're sure were going
to commit the changeset extraction. It's an independently reviewable and
testable piece of code that's simple enough to understand quickly in
contrast to the large changeset extraction patch. That's why I kept it
separate.
On the other hand, as you know, it's not without precedent to commit
pieces of infrastructure that aren't really useful for the enduser
(think FDW).

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: lcr v5 - introduction of InvalidCommandId

From
Andres Freund
Date:
On 2013-09-03 11:40:57 -0400, Robert Haas wrote:
> > 0002 wal_decoding: Introduce InvalidCommandId and declare that to be the new maximum for CommandCounterIncrement
> 
> I'm still unconvinced we want this.

Ok, so the reason for the existance of this patch is that currently
there is no way to represent a "unset" CommandId. This is a problem for
the following patches because we need to log the cmin, cmax of catalog
rows and obviously there can be rows where cmax is unset.
The reason I chose to change the definition of CommandIds is that the
other ondisk types we use like TransactionIds, XLogRecPtrs and such have
an "invalid" type, CommandIds don't. Changing their definition to have 0
- analogous to the previous examples - as their invalid value is not a
problem because CommandIds from pg_upgraded clusters may never be used
for anything. Going from 2^32 to 2^32-1 possible CommandIds doesn't seem
like a problem to me. Imo the CommandIds should have been defined that
way from the start.

Makes some sense?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: lcr v5 - introduction of InvalidCommandId

From
Robert Haas
Date:
On Wed, Sep 4, 2013 at 12:07 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-09-03 11:40:57 -0400, Robert Haas wrote:
>> > 0002 wal_decoding: Introduce InvalidCommandId and declare that to be the new maximum for CommandCounterIncrement
>>
>> I'm still unconvinced we want this.
>
> Ok, so the reason for the existance of this patch is that currently
> there is no way to represent a "unset" CommandId. This is a problem for
> the following patches because we need to log the cmin, cmax of catalog
> rows and obviously there can be rows where cmax is unset.

For heap tuples, we solve this problem by using flag bits.  Why not
adopt the same approach?

> The reason I chose to change the definition of CommandIds is that the
> other ondisk types we use like TransactionIds, XLogRecPtrs and such have
> an "invalid" type, CommandIds don't. Changing their definition to have 0
> - analogous to the previous examples - as their invalid value is not a
> problem because CommandIds from pg_upgraded clusters may never be used
> for anything. Going from 2^32 to 2^32-1 possible CommandIds doesn't seem
> like a problem to me. Imo the CommandIds should have been defined that
> way from the start.
>
> Makes some sense?

I don't have a problem with this if other people think it's a good
idea.  But I think it needs a few +1s and not too many -1s first, and
so far (AFAIK) no one else has weighed in with an opinion.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical changeset generation v5

From
Andres Freund
Date:
Hi,

On 2013-09-03 11:40:57 -0400, Robert Haas wrote:
> On Fri, Aug 30, 2013 at 11:19 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > 0005 wal_decoding: Log xl_running_xact's at a higher frequency than checkpoints are done
> > * benefits hot standby startup

I tried to update the patch to address the comments you made.

> > 0003 wal_decoding: Allow walsender's to connect to a specific database
> > * biggest problem is how to specify the connection we connect
> >   to. Currently with the patch walsender connects to a database if it's
> >   not named "replication" (via dbname). Perhaps it's better to invent a
> >   replication_dbname parameter?

I've updated the patch so it extends the "replication" startup parameter
to not only specify a boolean but also "database". In the latter case it
will connect to the database specified in "dbname".
As discussed downthread, this patch doesn't have an immediate advantage
for users until the changeset extraction patch itself is
applied. Whether or not it should be applied separately is unclear.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: lcr v5 - introduction of InvalidCommandId

From
Andres Freund
Date:
On 2013-09-05 12:44:18 -0400, Robert Haas wrote:
> On Wed, Sep 4, 2013 at 12:07 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2013-09-03 11:40:57 -0400, Robert Haas wrote:
> >> > 0002 wal_decoding: Introduce InvalidCommandId and declare that to be the new maximum for
CommandCounterIncrement
> >>
> >> I'm still unconvinced we want this.
> >
> > Ok, so the reason for the existance of this patch is that currently
> > there is no way to represent a "unset" CommandId. This is a problem for
> > the following patches because we need to log the cmin, cmax of catalog
> > rows and obviously there can be rows where cmax is unset.
> 
> For heap tuples, we solve this problem by using flag bits.  Why not
> adopt the same approach?

We can, while it makes the amount of data stored/logged slightly larger
and it seems to lead to less idiomatic code to me, so if there's another
-1 I'll go that way.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: lcr v5 - primary/candidate key in relcache

From
Andres Freund
Date:
Hi Kevin,

On 2013-09-03 11:40:57 -0400, Robert Haas wrote:
> On Fri, Aug 30, 2013 at 11:19 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > 0007 wal_decoding: Add information about a tables primary key to struct RelationData
> > * Could be used in the matview refresh code

> I think you and Kevin should discuss whether this is actually the
> right way to do this.  ISTM that if logical replication and
> materialized views end up selecting different approaches to this
> problem, everybody loses.

The patch we're discussion here adds a new struct RelationData field
called 'rd_primary' (should possibly be renamed) which contains
information about the "best" candidate key available for a table.

>From the header comments:
    /*
     * The 'best' primary or candidate key that has been found, only set
     * correctly if RelationGetIndexList has been called/rd_indexvalid > 0.
     *
     * Indexes are chosen in the following order:
     * * Primary Key
     * * oid index
     * * the first (OID order) unique, immediate, non-partial and
     *   non-expression index over one or more NOT NULL'ed columns
     */
    Oid rd_primary;

I thought we could use that in matview.c:refresh_by_match_merge() to
select a more efficient diff if rd_primary has a valid index. In that
case you only'd need to compare that index's fields which should result
in an more efficient plan.

Maybe it's also useful in other cases for you?

If it's relevant at all, would you like to have a different priority
list than the one above?

Regards,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: lcr v5 - introduction of InvalidCommandId

From
Robert Haas
Date:
On Thu, Sep 5, 2013 at 12:59 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-09-05 12:44:18 -0400, Robert Haas wrote:
>> On Wed, Sep 4, 2013 at 12:07 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > On 2013-09-03 11:40:57 -0400, Robert Haas wrote:
>> >> > 0002 wal_decoding: Introduce InvalidCommandId and declare that to be the new maximum for
CommandCounterIncrement
>> >>
>> >> I'm still unconvinced we want this.
>> >
>> > Ok, so the reason for the existance of this patch is that currently
>> > there is no way to represent a "unset" CommandId. This is a problem for
>> > the following patches because we need to log the cmin, cmax of catalog
>> > rows and obviously there can be rows where cmax is unset.
>>
>> For heap tuples, we solve this problem by using flag bits.  Why not
>> adopt the same approach?
>
> We can, while it makes the amount of data stored/logged slightly larger
> and it seems to lead to less idiomatic code to me, so if there's another
> -1 I'll go that way.

OK.  Consider me more of a -0 than a -1.  Like I say, I don't really
want to block it; I just don't feel comfortable committing it unless a
few other people say something like "I don't see a problem with that".Or maybe point me to relevant changeset
extractioncode that's going
 
to get messier.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: lcr v5 - introduction of InvalidCommandId

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> OK.  Consider me more of a -0 than a -1.  Like I say, I don't really
> want to block it; I just don't feel comfortable committing it unless a
> few other people say something like "I don't see a problem with that".

FWIW, I've always thought it was a wart that there wasn't a recognized
InvalidCommandId value.  It was never pressing to fix it before, but
if LCR needs it, let's do so.  I definitely *don't* find it cleaner to
eat up another flag bit to avoid that.  We don't have many to spare.

Ideally I'd have made InvalidCommandId = 0 and FirstCommandId = 1,
but I suppose we can't have that without an on-disk compatibility break.
        regards, tom lane



Re: lcr v5 - introduction of InvalidCommandId

From
Andres Freund
Date:
Hi,

Thanks for weighin in.

On 2013-09-05 14:21:33 -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > OK.  Consider me more of a -0 than a -1.  Like I say, I don't really
> > want to block it; I just don't feel comfortable committing it unless a
> > few other people say something like "I don't see a problem with that".
> 
> FWIW, I've always thought it was a wart that there wasn't a recognized
> InvalidCommandId value.  It was never pressing to fix it before, but
> if LCR needs it, let's do so.

Yes, its a bit anomalous to the other types.

>  I definitely *don't* find it cleaner to
> eat up another flag bit to avoid that.  We don't have many to spare.

It wouldn't need to be a flag bit in any existing struct, so that's not
a problem.

> Ideally I'd have made InvalidCommandId = 0 and FirstCommandId = 1,
> but I suppose we can't have that without an on-disk compatibility break.

The patch actually does change it exactly that way. My argument for that
being valid is that CommandIds don't play any role outside of their own
transaction. Now, somebody could argue that SELECT cmin, cmax can be
done outside the transaction, but: Those values are already pretty much
meaningless today since cmin/cmax have been merged. They also don't
check whether the field is initialized at all.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: lcr v5 - introduction of InvalidCommandId

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2013-09-05 14:21:33 -0400, Tom Lane wrote:
>> Ideally I'd have made InvalidCommandId = 0 and FirstCommandId = 1,
>> but I suppose we can't have that without an on-disk compatibility break.

> The patch actually does change it exactly that way.

Oh.  I hadn't looked at the patch, but I had (mis)read what Robert said
to think that you were proposing introducing InvalidCommandId = 0xFFFFFFFF
while leaving FirstCommandId alone.  That would make more sense to me as
(1) it doesn't change the interpretation of anything that's (likely to be)
on disk; (2) it allows the check for overflow in CommandCounterIncrement
to not involve recovering from an *actual* overflow.  With the horsing
around we've been seeing from the gcc boys lately, I don't have a warm
feeling about whether they won't break that test someday on the grounds
that "overflow is undefined behavior".
        regards, tom lane



Re: lcr v5 - introduction of InvalidCommandId

From
Peter Geoghegan
Date:
On Thu, Sep 5, 2013 at 11:30 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> Ideally I'd have made InvalidCommandId = 0 and FirstCommandId = 1,
>> but I suppose we can't have that without an on-disk compatibility break.
>
> The patch actually does change it exactly that way. My argument for that
> being valid is that CommandIds don't play any role outside of their own
> transaction.

Right. It seems like this should probably be noted in the
documentation under "5.4. System Columns" -- I just realized that it
isn't.

-- 
Peter Geoghegan



Re: lcr v5 - introduction of InvalidCommandId

From
Andres Freund
Date:
On 2013-09-05 14:37:01 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2013-09-05 14:21:33 -0400, Tom Lane wrote:
> >> Ideally I'd have made InvalidCommandId = 0 and FirstCommandId = 1,
> >> but I suppose we can't have that without an on-disk compatibility break.
> 
> > The patch actually does change it exactly that way.
> 
> Oh.  I hadn't looked at the patch, but I had (mis)read what Robert said
> to think that you were proposing introducing InvalidCommandId = 0xFFFFFFFF
> while leaving FirstCommandId alone.  That would make more sense to me as
> (1) it doesn't change the interpretation of anything that's (likely to be)
> on disk; (2) it allows the check for overflow in CommandCounterIncrement
> to not involve recovering from an *actual* overflow.  With the horsing
> around we've been seeing from the gcc boys lately

Ok, I can do it that way. LCR obviously shouldn't care.

> I don't have a warm
> feeling about whether they won't break that test someday on the grounds
> that "overflow is undefined behavior".

Unsigned overflow is pretty strictly defined, so I don't see much danger
there. Also, we'd feel the pain pretty definitely with xids...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: lcr v5 - introduction of InvalidCommandId

From
Andres Freund
Date:
On 2013-09-05 21:02:44 +0200, Andres Freund wrote:
> On 2013-09-05 14:37:01 -0400, Tom Lane wrote:
> > Andres Freund <andres@2ndquadrant.com> writes:
> > > On 2013-09-05 14:21:33 -0400, Tom Lane wrote:
> > >> Ideally I'd have made InvalidCommandId = 0 and FirstCommandId = 1,
> > >> but I suppose we can't have that without an on-disk compatibility break.
> >
> > > The patch actually does change it exactly that way.
> >
> > Oh.  I hadn't looked at the patch, but I had (mis)read what Robert said
> > to think that you were proposing introducing InvalidCommandId = 0xFFFFFFFF
> > while leaving FirstCommandId alone.  That would make more sense to me as
> > (1) it doesn't change the interpretation of anything that's (likely to be)
> > on disk; (2) it allows the check for overflow in CommandCounterIncrement
> > to not involve recovering from an *actual* overflow.  With the horsing
> > around we've been seeing from the gcc boys lately
>
> Ok, I can do it that way. LCR obviously shouldn't care.

It doesn't care to the point that the patch already does exactly what
you propose. It's just my memory that remembered things differently.

So, a very slightly updated patch attached.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: lcr v5 - introduction of InvalidCommandId

From
Robert Haas
Date:
On Thu, Sep 5, 2013 at 3:23 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > Oh.  I hadn't looked at the patch, but I had (mis)read what Robert said
>> > to think that you were proposing introducing InvalidCommandId = 0xFFFFFFFF
>> > while leaving FirstCommandId alone.  That would make more sense to me as
>> > (1) it doesn't change the interpretation of anything that's (likely to be)
>> > on disk; (2) it allows the check for overflow in CommandCounterIncrement
>> > to not involve recovering from an *actual* overflow.  With the horsing
>> > around we've been seeing from the gcc boys lately
>>
>> Ok, I can do it that way. LCR obviously shouldn't care.
>
> It doesn't care to the point that the patch already does exactly what
> you propose. It's just my memory that remembered things differently.
>
> So, a very slightly updated patch attached.

Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: lcr v5 - introduction of InvalidCommandId

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Sep 5, 2013 at 3:23 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> So, a very slightly updated patch attached.

> Committed.

Hmm ... shouldn't this patch adjust the error messages in
CommandCounterIncrement?  We just took away one possible command.
It's pretty nitpicky, especially since many utility commands do
more than one CommandCounterIncrement, but still ...
        regards, tom lane



Re: lcr v5 - introduction of InvalidCommandId

From
Andres Freund
Date:
On 2013-09-09 18:43:51 -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Thu, Sep 5, 2013 at 3:23 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> So, a very slightly updated patch attached.
> 
> > Committed.
> 
> Hmm ... shouldn't this patch adjust the error messages in
> CommandCounterIncrement?  We just took away one possible command.
> It's pretty nitpicky, especially since many utility commands do
> more than one CommandCounterIncrement, but still ...

Hm. You're talking about "cannot have more than 2^32-2 commands in a
transaction"? If so, the patch and the commit seem to have adjusted that?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: lcr v5 - introduction of InvalidCommandId

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2013-09-09 18:43:51 -0400, Tom Lane wrote:
>> Hmm ... shouldn't this patch adjust the error messages in
>> CommandCounterIncrement?

> Hm. You're talking about "cannot have more than 2^32-2 commands in a
> transaction"? If so, the patch and the commit seem to have adjusted that?

Oh!  That's what I get for going on memory instead of re-reading the
commit.  Sorry, never mind the noise.
        regards, tom lane



Re: lcr v5 - primary/candidate key in relcache

From
Kevin Grittner
Date:
Andres Freund <andres@2ndquadrant.com> wrote:
> Robert Haas wrote:
>> Andres Freund <andres@2ndquadrant.com> wrote:
>>> 0007 wal_decoding: Add information about a tables primary key to
>>>  struct RelationData
>>> * Could be used in the matview refresh code
>
>> I think you and Kevin should discuss whether this is actually the
>> right way to do this.  ISTM that if logical replication and
>> materialized views end up selecting different approaches to this
>> problem, everybody loses.
>
> The patch we're discussion here adds a new struct RelationData field
> called 'rd_primary' (should possibly be renamed) which contains
> information about the "best" candidate key available for a table.
>
> From the header comments:
>     /*
>     * The 'best' primary or candidate key that has been found, only set
>     * correctly if RelationGetIndexList has been called/rd_indexvalid > 0.
>     *
>     * Indexes are chosen in the following order:
>     * * Primary Key
>     * * oid index
>     * * the first (OID order) unique, immediate, non-partial and
>     *  non-expression index over one or more NOT NULL'ed columns
>     */
>     Oid rd_primary;
>
> I thought we could use that in matview.c:refresh_by_match_merge() to
> select a more efficient diff if rd_primary has a valid index. In that
> case you only'd need to compare that index's fields which should result
> in an more efficient plan.
>
> Maybe it's also useful in other cases for you?
>
> If it's relevant at all, would you like to have a different priority
> list than the one above?

My first thought was that it was necessary to use all unique,
immediate, non-partial, non-expression indexes to avoid getting
errors on the UPDATE phase of the concurrent refresh for transient
duplicates; but then I remembered that I had to give up on that and
do it all with DELETE followed by INSERT, which eliminates that
risk.  As things now stand the *existence* of any unique,
non-partial, non-expression index (note that immediate is not
needed) is sufficient for correctness.  We could now even drop that,
I think, if we added a duplicate check at the end in the absence of
such an index.

The reason I left it comparing columns from *all* such indexes is
that it gives the optimizer the chance to pick the one that looks
fastest.  With the upcoming patch that can add some extra
"equality" comparisons in addition to the "identical" comparisons
the patch uses, so the mechanism you propose might be a worthwhile
optimization for some cases as long as it does a good job of
picking *the fastest* such index.  The above method of choosing an
index doesn't seem to necessarily ensure that.

Also, if you need to include the "immediate" test, it could not be
used for RMVC without "fallback" code if this mechanism didn't find
an appropriate index.  Of course, that would satisfy those who
would like to relax the requirement for a unique index on the MV to
be able to use RMVC.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company