Thread: logical changeset generation v5
Hi! I am rather pleased to announce the next version of the changeset extraction patchset. Thanks to help from a large number of people I think we are slowly getting to the point where it is getting committable. Since the last submitted version (20121115002746.GA7692@awork2.anarazel.de) a large number of fixes and the result of good amount of review has been added to the tree. All bugs known to me have been fixed. Fixes include: * synchronous replication support * don't peg the xmin for user tables, do it only for catalog ones. * arbitrarily large transaction support by spilling large transactions to disk * spill snapshots to disk, so we can restart without waiting for a new snapshot to be built * Don't read all WAL from the establishment of a logical slot * tests via SQL interface to changeset extraction The todo list includes: * morph the "logical slot" interface into being "replication slots" that can also be used by streaming replication * move some more code from snapbuild.c to decode.c to remove a largely duplicated switch * do some more header/comment cleanup & clarification * move pg_receivellog into its own directory in src/bin or contrib/. * user/developer level documentation The patch series currently has two interfaces to logical decoding. One - which is primarily useful for pg_regress style tests and playing around - is SQL based, the other one uses a walsender replication connection. A quick demonstration of the SQL interface (server needs to be started with wal_level = logical and max_logical_slots > 0): =# CREATE EXTENSION test_logical_decoding; =# SELECT * FROM init_logical_replication('regression_slot', 'test_decoding'); slotname | xlog_position -----------------+---------------regression_slot | 0/17D5908 (1 row) =# CREATE TABLE foo(id serial primary key, data text); =# INSERT INTO foo(data) VALUES(1); =# UPDATE foo SET id = -id, data = ':'||data; =# DELETE FROM foo; =# DROP TABLE foo; =# SELECT * FROM start_logical_replication('regression_slot', 'now', 'hide-xids', '0');location | xid | data -----------+-----+--------------------------------------------------------------------------------0/17D59B8 | 695 | BEGIN0/17D59B8| 695 | COMMIT0/17E8B58 | 696 | BEGIN0/17E8B58 | 696 | table "foo": INSERT: id[int4]:1 data[text]:10/17E8B58| 696 | COMMIT0/17E8CA8 | 697 | BEGIN0/17E8CA8 | 697 | table "foo": UPDATE: old-pkey: id[int4]:1 new-tuple:id[int4]:-1 data[text]::10/17E8CA8 | 697 | COMMIT0/17E8E50 | 698 | BEGIN0/17E8E50 | 698 | table "foo": DELETE:id[int4]:-10/17E8E50 | 698 | COMMIT0/17E9058 | 699 | BEGIN0/17E9058 | 699 | COMMIT (13 rows) =# SELECT * FROM pg_stat_logical_decoding ; slot_name | plugin | database | active | xmin | restart_decoding_lsn -----------------+---------------+----------+--------+------+----------------------regression_slot | test_decoding | 12042| f | 695 | 0/17D58D0 (1 row) =# SELECT * FROM stop_logical_replication('regression_slot');stop_logical_replication -------------------------- 0 The walsender interface has the same calls INIT_LOGICAL_REPLICATION 'slot' 'plugin'; START_LOGICAL_REPLICATION 'slot' restart_lsn [(option value)*]; STOP_LOGICAL_REPLICATION 'slot'; The only difference is that START_LOGICAL_REPLICATION can stream changes and it can support synchronous replication. The output seen in the 'data' column is produced by a so called 'output plugin' which users of the facility can write to suit their needs. They can be written by implementing 5 functions in the shared object that's passed to init_logical_replication() above: * pg_decode_init (optional) * pg_decode_begin_txn * pg_decode_change * pg_decode_commit_txn * pg_decode_cleanup (optional) The most interesting function pg_decode_change get's passed a structure containing old/new versions of the row, the 'struct Relation' belonging to it and metainformation about the transaction. The output plugin can rely on syscache lookups et al. to decode the changed tuple in whatever fashion it wants. I'd like to invite reviewers to first look at: * the output plugin interface * the walsender/SRF interface * patch 12 which contains most of the code When reading the code, the information flow during decoding might be interesting: --------------- +---------------+ | XLogReader | +---------------+ | XLOG Records | v +---------------+ | decode.c | +---------------+ | | | | v | +---------------+ | | snapbuild.c | HeapTupleData +---------------+ | | | catalog snapshots | | | v v +---------------+ |reorderbuffer.c| +---------------+ | HeapTuple & Metadata | v +---------------+ | Output Plugin | +---------------+ | Whatever you want | v +---------------+ | Output Handler| | | |WalSnd or SRF | +---------------+ --------------- Overview of the attached patches: 0001: indirect toast tuples; required but submitted independently 0002: functions for testing; not required, 0003: (tablespace, filenode) syscache; required 0004: RelationMapFilenodeToOid: required, simple 0005: pg_relation_by_filenode() function; not required but useful 0006: Introduce InvalidCommandId: required, simple 0007: Adjust Satisfies* interface: required, mechanical, 0008: Allow walsender to attach to a database: required, needs review 0009: New GetOldestXmin() parameter; required, pretty boring 0010: Log xl_running_xact regularly in the bgwriter: required 0011: make fsync_fname() public; required, needs to be in a different file 0012: Relcache support for an Relation's primary key: required 0013: Actual changeset extraction; required 0014: Output plugin demo; not required (except for testing) but useful 0015: Add pg_receivellog program: not required but useful 0016: Add test_logical_decoding extension; not required, but contains the tests for the feature. Uses 0014 0017: Snapshot building docs; not required Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
The git tree is at: git://git.postgresql.org/git/users/andresfreund/postgres.git branch xlog-decoding-rebasing-cf4 http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-decoding-rebasing-cf4 On 2013-06-15 00:48:17 +0200, Andres Freund wrote: > Overview of the attached patches: > 0001: indirect toast tuples; required but submitted independently > 0002: functions for testing; not required, > 0003: (tablespace, filenode) syscache; required > 0004: RelationMapFilenodeToOid: required, simple > 0005: pg_relation_by_filenode() function; not required but useful > 0006: Introduce InvalidCommandId: required, simple > 0007: Adjust Satisfies* interface: required, mechanical, > 0008: Allow walsender to attach to a database: required, needs review > 0009: New GetOldestXmin() parameter; required, pretty boring > 0010: Log xl_running_xact regularly in the bgwriter: required > 0011: make fsync_fname() public; required, needs to be in a different file > 0012: Relcache support for an Relation's primary key: required > 0013: Actual changeset extraction; required > 0014: Output plugin demo; not required (except for testing) but useful > 0015: Add pg_receivellog program: not required but useful > 0016: Add test_logical_decoding extension; not required, but contains > the tests for the feature. Uses 0014 > 0017: Snapshot building docs; not required Version v5-01 attached Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
- 0001-Add-support-for-multiple-kinds-of-external-toast-dat.patch
- 0002-wal_decoding-Add-pg_xlog_wait_remote_-apply-receive-.patch
- 0003-wal_decoding-Add-a-new-RELFILENODE-syscache-to-fetch.patch
- 0004-wal_decoding-Add-RelationMapFilenodeToOid-function-t.patch
- 0005-wal_decoding-Add-pg_relation_by_filenode-to-lookup-u.patch
- 0006-wal_decoding-Introduce-InvalidCommandId-and-declare-.patch
- 0007-wal_decoding-Adjust-all-Satisfies-routines-to-take-a.patch
- 0008-wal_decoding-Allow-walsender-s-to-connect-to-a-speci.patch
- 0009-wal_decoding-Add-alreadyLocked-parameter-to-GetOldes.patch
- 0010-wal_decoding-Log-xl_running_xact-s-at-a-higher-frequ.patch
- 0011-wal_decoding-copydir-make-fsync_fname-public.patch
- 0012-wal_decoding-Add-information-about-a-tables-primary-.patch
- 0013-wal_decoding-Introduce-wal-decoding-via-catalog-time.patch
- 0014-wal_decoding-test_decoding-Add-a-simple-decoding-mod.patch
- 0015-wal_decoding-pg_receivellog-Introduce-pg_receivexlog.patch
- 0016-wal_decoding-test_logical_decoding-Add-extension-for.patch
- 0017-wal_decoding-design-document-v2.4-and-snapshot-build.patch
Andres Freund <andres@2ndquadrant.com> wrote: >> 0007: Adjust Satisfies* interface: required, mechanical, > Version v5-01 attached I'm still working on a review and hope to post something more substantive by this weekend, but when applying patches in numeric order, this one did not compile cleanly. gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -I../../../../src/include-D_GNU_SOURCE -I/usr/include/libxml2 -c -o allpaths.o allpaths.c -MMD -MP -MF .deps/allpaths.Po vacuumlazy.c: In function ‘heap_page_is_all_visible’: vacuumlazy.c:1725:3: warning: passing argument 1 of ‘HeapTupleSatisfiesVacuum’ from incompatible pointer type [enabled bydefault] In file included from vacuumlazy.c:61:0: ../../../src/include/utils/tqual.h:84:20: note: expected ‘HeapTuple’ but argument is of type ‘HeapTupleHeader’ Could you post a new version of that? -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi Kevin! On 2013-06-20 15:57:07 -0700, Kevin Grittner wrote: > Andres Freund <andres@2ndquadrant.com> wrote: > > >> 0007: Adjust Satisfies* interface: required, mechanical, > > > Version v5-01 attached > > I'm still working on a review and hope to post something more > substantive by this weekend Cool! >, but when applying patches in numeric > order, this one did not compile cleanly. > > gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -I../../../../src/include-D_GNU_SOURCE -I/usr/include/libxml2 -c -o allpaths.o allpaths.c -MMD -MP -MF .deps/allpaths.Po > vacuumlazy.c: In function ‘heap_page_is_all_visible’: > vacuumlazy.c:1725:3: warning: passing argument 1 of ‘HeapTupleSatisfiesVacuum’ from incompatible pointer type [enabledby default] > In file included from vacuumlazy.c:61:0: > ../../../src/include/utils/tqual.h:84:20: note: expected ‘HeapTuple’ but argument is of type ‘HeapTupleHeader’ > > Could you post a new version of that? Hrmpf. There was one hunk in 0013 instead of 0007. I made sure that every commit again applies and compiles cleanly. git rebase -i --exec to the rescue. Found two other issues: * recptr not assigned in 0010 * unsafe use of non-volatile variable across longjmp() 0013 Pushed and attached. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
- 0001-Add-support-for-multiple-kinds-of-external-toast-dat.patch.gz
- 0002-wal_decoding-Add-pg_xlog_wait_remote_-apply-receive-.patch.gz
- 0003-wal_decoding-Add-a-new-RELFILENODE-syscache-to-fetch.patch.gz
- 0004-wal_decoding-Add-RelationMapFilenodeToOid-function-t.patch.gz
- 0005-wal_decoding-Add-pg_relation_by_filenode-to-lookup-u.patch.gz
- 0006-wal_decoding-Introduce-InvalidCommandId-and-declare-.patch.gz
- 0007-wal_decoding-Adjust-all-Satisfies-routines-to-take-a.patch.gz
- 0008-wal_decoding-Allow-walsender-s-to-connect-to-a-speci.patch.gz
- 0009-wal_decoding-Add-alreadyLocked-parameter-to-GetOldes.patch.gz
- 0010-wal_decoding-Log-xl_running_xact-s-at-a-higher-frequ.patch.gz
- 0011-wal_decoding-copydir-make-fsync_fname-public.patch.gz
- 0012-wal_decoding-Add-information-about-a-tables-primary-.patch.gz
- 0013-wal_decoding-Introduce-wal-decoding-via-catalog-time.patch.gz
- 0014-wal_decoding-test_decoding-Add-a-simple-decoding-mod.patch.gz
- 0015-wal_decoding-pg_receivellog-Introduce-pg_receivexlog.patch.gz
- 0016-wal_decoding-test_logical_decoding-Add-extension-for.patch.gz
- 0017-wal_decoding-design-document-v2.4-and-snapshot-build.patch.gz
Andres Freund <andres@2ndquadrant.com> wrote: > The git tree is at: > git://git.postgresql.org/git/users/andresfreund/postgres.git branch > xlog-decoding-rebasing-cf4 > http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-decoding-rebasing-cf4 > > On 2013-06-15 00:48:17 +0200, Andres Freund wrote: >> Overview of the attached patches: >> 0001: indirect toast tuples; required but submitted independently >> 0002: functions for testing; not required, >> 0003: (tablespace, filenode) syscache; required >> 0004: RelationMapFilenodeToOid: required, simple >> 0005: pg_relation_by_filenode() function; not required but useful >> 0006: Introduce InvalidCommandId: required, simple >> 0007: Adjust Satisfies* interface: required, mechanical, >> 0008: Allow walsender to attach to a database: required, needs review >> 0009: New GetOldestXmin() parameter; required, pretty boring >> 0010: Log xl_running_xact regularly in the bgwriter: required >> 0011: make fsync_fname() public; required, needs to be in a different file >> 0012: Relcache support for an Relation's primary key: required >> 0013: Actual changeset extraction; required >> 0014: Output plugin demo; not required (except for testing) but useful >> 0015: Add pg_receivellog program: not required but useful >> 0016: Add test_logical_decoding extension; not required, but contains >> the tests for the feature. Uses 0014 >> 0017: Snapshot building docs; not required > > Version v5-01 attached Confirmed that all 17 patch files now apply cleanly, and that `make check-world` builds cleanly after each patch in turn. Reviewing and testing the final result now. If that all looks good, will submit a separate review of each patch. Simon, do you want to do the final review and commit after I do each piece? -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Kevin Grittner <kgrittn@ymail.com> wrote: > Confirmed that all 17 patch files now apply cleanly, and that `make > check-world` builds cleanly after each patch in turn. Just to be paranoid, I did one last build with all 17 patch files applied to 7dfd5cd21c0091e467b16b31a10e20bbedd0a836 using this line: make maintainer-clean ; ./configure --prefix=$PWD/Debug --enable-debug --enable-cassert --enable-depend --with-libxml --with-libxslt--with-openssl --with-perl --with-python && make -j4 world and it died with this: gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -I../../../src/interfaces/libpq-I../../../src/include -D_GNU_SOURCE -I/usr/include/libxml2 -c -o pg_receivexlog.o pg_receivexlog.c-MMD -MP -MF .deps/pg_receivexlog.Po gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -I. -I. -I../../../src/interfaces/libpq-I../../../src/bin/pg_dump -I../../../src/include -D_GNU_SOURCE -I/usr/include/libxml2 -c-o mainloop.o mainloop.c -MMD -MP -MF .deps/mainloop.Po gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g pg_receivellog.oreceivelog.o streamutil.o -L../../../src/port -lpgport -L../../../src/common -lpgcommon -L../../../src/interfaces/libpq-lpq -L../../../src/port -L../../../src/common -L/usr/lib -Wl,--as-needed -Wl,-rpath,'/home/kgrittn/pg/master/Debug/lib',--enable-new-dtags -lpgport -lpgcommon -lxslt -lxml2 -lssl -lcrypto -lz -lreadline-lcrypt -ldl -lm -o pg_receivellog gcc: error: pg_receivellog.o: No such file or directory make[3]: *** [pg_receivellog] Error 1 make[3]: Leaving directory `/home/kgrittn/pg/master/src/bin/pg_basebackup' make[2]: *** [all-pg_basebackup-recurse] Error 2 make[2]: *** Waiting for unfinished jobs.... It works with this patch-on-patch: diff --git a/src/bin/pg_basebackup/Makefile b/src/bin/pg_basebackup/Makefile index a41b73c..18d02f3 100644 --- a/src/bin/pg_basebackup/Makefile +++ b/src/bin/pg_basebackup/Makefile @@ -42,6 +42,7 @@ installdirs: uninstall: rm -f '$(DESTDIR)$(bindir)/pg_basebackup$(X)' rm -f '$(DESTDIR)$(bindir)/pg_receivexlog$(X)' + rm -f '$(DESTDIR)$(bindir)/pg_receivellog$(X)' clean distclean maintainer-clean: - rm -f pg_basebackup$(X) pg_receivexlog$(X) $(OBJS) pg_basebackup.o pg_receivexlog.o pg_receivellog.o + rm -f pg_basebackup$(X) pg_receivexlog$(X) pg_receivellog$(X) $(OBJS) pg_basebackup.o pg_receivexlog.o pg_receivellog.o It appears to be an omission from file 0015. -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Kevin Grittner <kgrittn@ymail.com> wrote: > uninstall: > rm -f '$(DESTDIR)$(bindir)/pg_basebackup$(X)' > rm -f '$(DESTDIR)$(bindir)/pg_receivexlog$(X)' > + rm -f '$(DESTDIR)$(bindir)/pg_receivellog$(X)' Oops. That part is not needed. > clean distclean maintainer-clean: > - rm -f pg_basebackup$(X) pg_receivexlog$(X) $(OBJS) pg_basebackup.o pg_receivexlog.o pg_receivellog.o > + rm -f pg_basebackup$(X) pg_receivexlog$(X) pg_receivellog$(X) $(OBJS) pg_basebackup.o pg_receivexlog.o pg_receivellog.o Just that part. -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Andres Freund <andres@2ndquadrant.com> wrote: > Pushed and attached. The contrib/test_logical_decoding/sql/ddl.sql script is generating unexpected results. For both table_with_pkey and table_with_unique_not_null, updates of the primary key column are showing: old-pkey: id[int4]:0 ... instead of the expected value of 2 or -2. See attached. -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On 2013-06-23 08:27:32 -0700, Kevin Grittner wrote: > Kevin Grittner <kgrittn@ymail.com> wrote: > > > Confirmed that all 17 patch files now apply cleanly, and that `make > > check-world` builds cleanly after each patch in turn. > > Just to be paranoid, I did one last build with all 17 patch files > applied to 7dfd5cd21c0091e467b16b31a10e20bbedd0a836 using this > line: > > make maintainer-clean ; ./configure --prefix=$PWD/Debug --enable-debug --enable-cassert --enable-depend --with-libxml --with-libxslt--with-openssl --with-perl --with-python && make -j4 world > > and it died with this: > > gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -I../../../src/interfaces/libpq-I../../../src/include -D_GNU_SOURCE -I/usr/include/libxml2 -c -o pg_receivexlog.o pg_receivexlog.c-MMD -MP -MF .deps/pg_receivexlog.Po > gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -I. -I. -I../../../src/interfaces/libpq-I../../../src/bin/pg_dump -I../../../src/include -D_GNU_SOURCE -I/usr/include/libxml2 -c-o mainloop.o mainloop.c -MMD -MP -MF .deps/mainloop.Po > gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g pg_receivellog.oreceivelog.o streamutil.o -L../../../src/port -lpgport -L../../../src/common -lpgcommon -L../../../src/interfaces/libpq-lpq -L../../../src/port -L../../../src/common -L/usr/lib -Wl,--as-needed -Wl,-rpath,'/home/kgrittn/pg/master/Debug/lib',--enable-new-dtags -lpgport -lpgcommon -lxslt -lxml2 -lssl -lcrypto -lz -lreadline-lcrypt -ldl -lm -o pg_receivellog > gcc: error: pg_receivellog.o: No such file or directory > make[3]: *** [pg_receivellog] Error 1 > make[3]: Leaving directory `/home/kgrittn/pg/master/src/bin/pg_basebackup' > make[2]: *** [all-pg_basebackup-recurse] Error 2 > make[2]: *** Waiting for unfinished jobs.... I have seen that once as well. It's really rather strange since pg_receivellog.o is a clear prerequisite for pg_receivellog. I couldn't reproduce it reliably though, even after doing some dozen rebuilds or so. > It works with this patch-on-patch: > diff --git a/src/bin/pg_basebackup/Makefile b/src/bin/pg_basebackup/Makefile > index a41b73c..18d02f3 100644 > --- a/src/bin/pg_basebackup/Makefile > +++ b/src/bin/pg_basebackup/Makefile > @@ -42,6 +42,7 @@ installdirs: > uninstall: > rm -f '$(DESTDIR)$(bindir)/pg_basebackup$(X)' > rm -f '$(DESTDIR)$(bindir)/pg_receivexlog$(X)' > + rm -f '$(DESTDIR)$(bindir)/pg_receivellog$(X)' > > clean distclean maintainer-clean: > - rm -f pg_basebackup$(X) pg_receivexlog$(X) $(OBJS) pg_basebackup.o pg_receivexlog.o pg_receivellog.o > + rm -f pg_basebackup$(X) pg_receivexlog$(X) pg_receivellog$(X) $(OBJS) pg_basebackup.o pg_receivexlog.o pg_receivellog.o > > It appears to be an omission from file 0015. Yes, both are missing. > > + rm -f '$(DESTDIR)$(bindir)/pg_receivellog$(X)' > Oops. That part is not needed. Hm. Why not? I don't think either hunk has anything to do with that buildfailure though - can you reproduce the error without? Thanks, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-06-23 10:32:05 -0700, Kevin Grittner wrote: > Andres Freund <andres@2ndquadrant.com> wrote: > > > Pushed and attached. > > The contrib/test_logical_decoding/sql/ddl.sql script is generating > unexpected results. For both table_with_pkey and > table_with_unique_not_null, updates of the primary key column are > showing: > > old-pkey: id[int4]:0 > > ... instead of the expected value of 2 or -2. > > See attached. Hm. Any chance this was an incomplete rebuild? I seem to remember having seen that once because some header dependency wasn't recognized correctly after applying some patch. Otherwise, could you give me: * the version you aplied the patch on * os/compiler Because I can't reproduce it, despite some playing around... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2013-06-23 08:27:32 -0700, Kevin Grittner wrote: >> gcc: error: pg_receivellog.o: No such file or directory >> make[3]: *** [pg_receivellog] Error 1 > I have seen that once as well. It's really rather strange since > pg_receivellog.o is a clear prerequisite for pg_receivellog. I couldn't > reproduce it reliably though, even after doing some dozen rebuilds or so. What versions of gmake are you guys using? It wouldn't be the first time we've tripped over bugs in parallel make. See for instance http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=1fc698cf14d17a3a8ad018cf9ec100198a339447 regards, tom lane
On 2013-06-23 16:48:41 -0400, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2013-06-23 08:27:32 -0700, Kevin Grittner wrote: > >> gcc: error: pg_receivellog.o: No such file or directory > >> make[3]: *** [pg_receivellog] Error 1 > > > I have seen that once as well. It's really rather strange since > > pg_receivellog.o is a clear prerequisite for pg_receivellog. I couldn't > > reproduce it reliably though, even after doing some dozen rebuilds or so. > > What versions of gmake are you guys using? It wouldn't be the first > time we've tripped over bugs in parallel make. See for instance > http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=1fc698cf14d17a3a8ad018cf9ec100198a339447 3.81 here. That was supposed to be the "safe" one, right? At least to the bugs seen/fixed recently. Kevin, any chance you still have more log than in the upthread mail available? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-06-23 08:27:32 -0700, Kevin Grittner wrote: >> make maintainer-clean ; ./configure --prefix=$PWD/Debug --enable-debug >> --enable-cassert --enable-depend --with-libxml --with-libxslt --with-openssl >> --with-perl --with-python && make -j4 world >> [ build failure referencing pg_receivellog.o ] > I have seen that once as well. It's really rather strange since > pg_receivellog.o is a clear prerequisite for pg_receivellog. I couldn't > reproduce it reliably though, even after doing some dozen rebuilds or so. > >> It works with this patch-on-patch: >> clean distclean maintainer-clean: >> - rm -f pg_basebackup$(X) pg_receivexlog$(X) $(OBJS) pg_basebackup.o >> pg_receivexlog.o pg_receivellog.o >> + rm -f pg_basebackup$(X) pg_receivexlog$(X) pg_receivellog$(X) $(OBJS) >> pg_basebackup.o pg_receivexlog.o pg_receivellog.o >> > + rm -f '$(DESTDIR)$(bindir)/pg_receivellog$(X)' >> Oops. That part is not needed. > > Hm. Why not? Well, I could easily be wrong on just about anything to do with make files, but on a second look that appeared to be dealing with eliminating an installed pg_receivellog binary, which is not created. > I don't think either hunk has anything to do with that buildfailure > though - can you reproduce the error without? I tried that scenario three times and it failed three times. Then I made the above changes and it worked. Then I eliminated the one on the uninstall target and tried a couple more times and it worked on both attempts. The scenario is to have a `make world` build in the source tree, and run the above line starting with `make maintainer-clean` and going to `make -j4 world`. I did notice that without that change to the maintainer-clean target I did not get a pg_receivellog.Po file in src/bin/pg_basebackup/.deps/ -- and with it I do. I admit to being at about a 1.5 on a 10 point scale of make file competence -- I just look for patterns used for things similar to what I want to do and copy without much understanding of what it all means. :-( So when I got an error on pg_receivellog which didn't happen on pg_receivexlog, I looked for differences -- my suggestion has no more basis than that and the fact that empirical testing seemed to show that it worked. -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-06-23 16:48:41 -0400, Tom Lane wrote: >> Andres Freund <andres@2ndquadrant.com> writes: >>> On 2013-06-23 08:27:32 -0700, Kevin Grittner wrote: >>>> gcc: error: pg_receivellog.o: No such file or directory >>>> make[3]: *** [pg_receivellog] Error 1 >> >>> I have seen that once as well. It's really rather strange since >>> pg_receivellog.o is a clear prerequisite for pg_receivellog. I >>> couldn't reproduce it reliably though, even after doing some >>> dozen rebuilds or so. >> >> What versions of gmake are you guys using? It wouldn't be the >> first time we've tripped over bugs in parallel make. See for >> instance >> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=1fc698cf14d17a3a8ad018cf9ec100198a339447 > > 3.81 here. That was supposed to be the "safe" one, right? At > least to the bugs seen/fixed recently. There is no executable named gmake in my distro, but... kgrittn@Kevin-Desktop:~/pg/master$ make --version GNU Make 3.81 Which is what I'm using. > Kevin, any chance you still have more log than in the upthread > mail available? Well, I just copied from the console, and that was gone; but reverting my change I get the same thing. All console output attached. Let me know if you need something else. Note that the dependency file disappeared: kgrittn@Kevin-Desktop:~/pg/master$ ll src/bin/pg_basebackup/.deps/ total 24 drwxrwxr-x 2 kgrittn kgrittn 4096 Jun 24 08:57 ./ drwxrwxr-x 4 kgrittn kgrittn 4096 Jun 24 08:57 ../ -rw-rw-r-- 1 kgrittn kgrittn 1298 Jun 24 08:57 pg_basebackup.Po -rw-rw-r-- 1 kgrittn kgrittn 1729 Jun 24 08:57 pg_receivexlog.Po -rw-rw-r-- 1 kgrittn kgrittn 1646 Jun 24 08:57 receivelog.Po -rw-rw-r-- 1 kgrittn kgrittn 953 Jun 24 08:57 streamutil.Po It was there from the build with the change I made to the maintainer-clean target, and went away when I built without it. -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-06-23 10:32:05 -0700, Kevin Grittner wrote: >> The contrib/test_logical_decoding/sql/ddl.sql script is generating >> unexpected results. For both table_with_pkey and >> table_with_unique_not_null, updates of the primary key column are >> showing: >> >> old-pkey: id[int4]:0 >> >> ... instead of the expected value of 2 or -2. >> >> See attached. > > Hm. Any chance this was an incomplete rebuild? With my hack on the pg_basebackup Makefile, `make -j4 world` is finishing with no errors and: PostgreSQL, contrib, and documentation successfully made. Ready to install. > I seem to remember having seen that once because some header > dependency wasn't recognized correctly after applying some patch. I wonder whether this is related to the build problems we've been discussing on the other fork of this thread. I was surprised to see this error when I got past the maintainer-clean full build problems, because I thought I had seen clean `make check-world` regression tests after applying each incremental patch file. Until I read this I had been assuming that somehow I missed the error on the 16th and 17th iterations; but now I'm suspecting that I didn't miss anything after all -- it may just be another symptom of the build problems. > Otherwise, could you give me: > * the version you aplied the patch on 7dfd5cd21c0091e467b16b31a10e20bbedd0a836 > * os/compiler Linux Kevin-Desktop 3.5.0-34-generic #55-Ubuntu SMP Thu Jun 6 20:18:19 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux gcc (Ubuntu/Linaro 4.7.2-2ubuntu1) 4.7.2 > Because I can't reproduce it, despite some playing around... Maybe if you can reproduce the build problems I'm seeing.... -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-06-24 06:44:53 -0700, Kevin Grittner wrote: > Andres Freund <andres@2ndquadrant.com> wrote: > > On 2013-06-23 08:27:32 -0700, Kevin Grittner wrote: > > >> make maintainer-clean ; ./configure --prefix=$PWD/Debug --enable-debug > >> --enable-cassert --enable-depend --with-libxml --with-libxslt --with-openssl > >> --with-perl --with-python && make -j4 world > > >> [ build failure referencing pg_receivellog.o ] > > > I have seen that once as well. It's really rather strange since > > pg_receivellog.o is a clear prerequisite for pg_receivellog. I couldn't > > reproduce it reliably though, even after doing some dozen rebuilds or so. > > > >> It works with this patch-on-patch: > > >> clean distclean maintainer-clean: > >> - rm -f pg_basebackup$(X) pg_receivexlog$(X) $(OBJS) pg_basebackup.o > >> pg_receivexlog.o pg_receivellog.o > >> + rm -f pg_basebackup$(X) pg_receivexlog$(X) pg_receivellog$(X) $(OBJS) > >> pg_basebackup.o pg_receivexlog.o pg_receivellog.o > > >> > + rm -f '$(DESTDIR)$(bindir)/pg_receivellog$(X)' > >> Oops. That part is not needed. > > > > Hm. Why not? > > Well, I could easily be wrong on just about anything to do with > make files, but on a second look that appeared to be dealing with > eliminating an installed pg_receivellog binary, which is not > created. I think it actually is? install: all installdirs$(INSTALL_PROGRAM) pg_basebackup$(X) '$(DESTDIR)$(bindir)/pg_basebackup$(X)'$(INSTALL_PROGRAM) pg_receivexlog$(X)'$(DESTDIR)$(bindir)/pg_receivexlog$(X)'$(INSTALL_PROGRAM) pg_receivellog$(X) '$(DESTDIR)$(bindir)/pg_receivellog$(X)' > > I don't think either hunk has anything to do with that buildfailure > > though - can you reproduce the error without? > > I tried that scenario three times and it failed three times. Then > I made the above changes and it worked. Then I eliminated the one > on the uninstall target and tried a couple more times and it worked > on both attempts. The scenario is to have a `make world` build in > the source tree, and run the above line starting with `make > maintainer-clean` and going to `make -j4 world`. Hm. I think it might be something in makes intermediate target logic biting us. Anyway, if the patch fixes that: Great ;). Merged it logally since it's obviously missing. > I did notice that without that change to the maintainer-clean > target I did not get a pg_receivellog.Po file in > src/bin/pg_basebackup/.deps/ -- and with it I do. Yea, according to your log it's not even built before pg_receivellog is linked. Thanks, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-06-24 06:44:53 -0700, Kevin Grittner wrote: >> Andres Freund <andres@2ndquadrant.com> wrote: >>> On 2013-06-23 08:27:32 -0700, Kevin Grittner wrote: >>>>> + rm -f '$(DESTDIR)$(bindir)/pg_receivellog$(X)' >>>> Oops. That part is not needed. >>> >>> Hm. Why not? >> >> Well, I could easily be wrong on just about anything to do with >> make files, but on a second look that appeared to be dealing with >> eliminating an installed pg_receivellog binary, which is not >> created. > > I think it actually is? Oh, yeah.... I see it now. I warned you I could be wrong. :-/ I just had a thought thought -- perhaps the dependency information is being calculated incorrectly. Attached is the dependency file from the successful build (with the adjusted Makefile), which still fails the test_logical_decoding regression test, with the same diff. -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On 2013-06-24 07:29:43 -0700, Kevin Grittner wrote: > Andres Freund <andres@2ndquadrant.com> wrote: > > On 2013-06-23 10:32:05 -0700, Kevin Grittner wrote: > > >> The contrib/test_logical_decoding/sql/ddl.sql script is generating > >> unexpected results. For both table_with_pkey and > >> table_with_unique_not_null, updates of the primary key column are > >> showing: > >> > >> old-pkey: id[int4]:0 > >> > >> ... instead of the expected value of 2 or -2. > >> > >> See attached. > > > > Hm. Any chance this was an incomplete rebuild? > > With my hack on the pg_basebackup Makefile, `make -j4 world` is > finishing with no errors and: Hm. There were some issues with the test_logical_decoding Makefile not cleaning up the regression installation properly. Which might have caused the issue. Could you try after applying the patches and executing a clean and then rebuild? Otherwise, could you try applying my git tree so we are sure we test the same thing? $ git remote add af git://git.postgresql.org/git/users/andresfreund/postgres.git $ git fetch af $ git checkout -b xlog-decoding af/xlog-decoding-rebasing-cf4 $ ./configure ... $ make > > Because I can't reproduce it, despite some playing around... > > Maybe if you can reproduce the build problems I'm seeing.... Tried your recipe but still couldn't... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
I'm looking at the combined patches 0003-0005, which are essentially all about adding a function to obtain relation OID from (tablespace, filenode). It takes care to look through the relation mapper, and uses a new syscache underneath for performance. One question about this patch, originally, was about the usage of that relfilenode syscache. It is questionable because it would be the only syscache to apply on top of a non-unique index. It is said that this doesn't matter because the only non-unique values that can exist would reference entries that have relfilenode = 0; and in turn this doesn't matter because those values would be queried through the relation mapper anyway, not from the syscache. (This is implemented in the higher-level function.) This means that there would be one syscache that is damn easy to misuse .. and we've setup things so that syscaches are very easy to use in the first place. From that perspective, this doesn't look good. However, it's an easy mistake to notice and fix, so perhaps this is not a serious problem. (I would much prefer for there to be a way to define partial indexes in BKI.) I'm not sure about the placing of the new SQL-callable function in dbsize.c either. It is certainly not a function that has anything to do with object sizes. The insides of it would belong more in lsyscache.c, I think, except then that file does not otherwise concern itself with the relation mapper so its scope would have to expand a bit. But this is no place for the SQL-callable portion, so that would have to find a different home as well. The other option, of course, it to provide a separate caching layer for these objects altogether, but given how concise this implementation is, it doesn't sound too palatable. Thoughts? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 2013-06-27 17:33:04 -0400, Alvaro Herrera wrote: > One question about this patch, originally, was about the usage of > that relfilenode syscache. It is questionable because it would be the > only syscache to apply on top of a non-unique index. It is said that > this doesn't matter because the only non-unique values that can exist > would reference entries that have relfilenode = 0; and in turn this > doesn't matter because those values would be queried through the > relation mapper anyway, not from the syscache. (This is implemented in > the higher-level function.) Well, you can even query the syscache without hurt for mapped relations, you just won't get an answer. The only thing you may not do because it would yield multiple results is to query the syscache with (tablespace, InvalidOid/0). Which is still not nice although it doesn't make much sense to query with InvalidOid. > I'm not sure about the placing of the new SQL-callable function in > dbsize.c either. It is certainly not a function that has anything to do > with object sizes. Not happy with that myself. I only placed the function there because pg_relation_filenode() already was in it. Happy to change if somebody has a good idea. > (I would much prefer for there to be a way to define partial > indexes in BKI.) I don't think that's the hard part, it's that we don't use the full machinery for updating indexes but rather the relatively simplistic CatalogUpdateIndexes(). I am not sure we can guarantee that the required infrastructure is available in all the cases to support doing generic predicate evaluation. Should bki really be the problem we probably could create the index after bki based bootstrapping finished. Thanks, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund wrote: > On 2013-06-27 17:33:04 -0400, Alvaro Herrera wrote: > > One question about this patch, originally, was about the usage of > > that relfilenode syscache. It is questionable because it would be the > > only syscache to apply on top of a non-unique index. It is said that > > this doesn't matter because the only non-unique values that can exist > > would reference entries that have relfilenode = 0; and in turn this > > doesn't matter because those values would be queried through the > > relation mapper anyway, not from the syscache. (This is implemented in > > the higher-level function.) > > Well, you can even query the syscache without hurt for mapped relations, > you just won't get an answer. The only thing you may not do because it > would yield multiple results is to query the syscache with > (tablespace, InvalidOid/0). Which is still not nice although it doesn't > make much sense to query with InvalidOid. Yeah, I agree that it doesn't make sense to query for that. The problem is that something could reasonably be developed that uses the syscache directly without checking whether the relfilenode is 0. > > (I would much prefer for there to be a way to define partial > > indexes in BKI.) > > I don't think that's the hard part, it's that we don't use the full > machinery for updating indexes but rather the relatively simplistic > CatalogUpdateIndexes(). I am not sure we can guarantee that the required > infrastructure is available in all the cases to support doing generic > predicate evaluation. You're right, CatalogIndexInsert() doesn't allow for predicates, so fixing BKI would not help. I still wonder about having a separate cache. Right now pg_class has two indexes; adding this new one would mean a rather large decrease in insert performance (50% more indexes to update than previously), which is not good considering that it's inserted into for each and every temp table creation -- that would become slower. This would be a net loss for every user, even those that don't want logical replication. On the other hand, table creation also has to add tuples to pg_attribute, pg_depend, pg_shdepend and maybe other catalogs, so perhaps the difference is negligible. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > I'm looking at the combined patches 0003-0005, which are essentially all > about adding a function to obtain relation OID from (tablespace, > filenode). It takes care to look through the relation mapper, and uses > a new syscache underneath for performance. > One question about this patch, originally, was about the usage of > that relfilenode syscache. It is questionable because it would be the > only syscache to apply on top of a non-unique index. ... which, I assume, is on top of a pg_class index that doesn't exist today. Exactly what is the argument that says performance of this function is sufficiently critical to justify adding both the maintenance overhead of a new pg_class index, *and* a broken-by-design syscache? Lose the cache and this probably gets a lot easier to justify. As is, I think I'd vote to reject altogether. regards, tom lane
On Thu, Jun 27, 2013 at 6:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: >> I'm looking at the combined patches 0003-0005, which are essentially all >> about adding a function to obtain relation OID from (tablespace, >> filenode). It takes care to look through the relation mapper, and uses >> a new syscache underneath for performance. > >> One question about this patch, originally, was about the usage of >> that relfilenode syscache. It is questionable because it would be the >> only syscache to apply on top of a non-unique index. > > ... which, I assume, is on top of a pg_class index that doesn't exist > today. Exactly what is the argument that says performance of this > function is sufficiently critical to justify adding both the maintenance > overhead of a new pg_class index, *and* a broken-by-design syscache? > > Lose the cache and this probably gets a lot easier to justify. As is, > I think I'd vote to reject altogether. I already voted that way, and nothing's happened since to change my mind. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Andres Freund <andres@2ndquadrant.com> wrote: > Hm. There were some issues with the test_logical_decoding > Makefile not cleaning up the regression installation properly. > Which might have caused the issue. > > Could you try after applying the patches and executing a clean > and then rebuild? Tried, and problem persists. > Otherwise, could you try applying my git tree so we are sure we > test the same thing? > > $ git remote add af git://git.postgresql.org/git/users/andresfreund/postgres.git > $ git fetch af > $ git checkout -b xlog-decoding af/xlog-decoding-rebasing-cf4 > $ ./configure ... > $ make Tried that, too, and problem persists. The log shows the last commit on your branch as 022c2da1873de2fbc93ae524819932719ca41bdb. Because you mention possible problems with the regression test cleanup for test_logical_decoding I also tried: rm -fr contrib/test_logical_decoding/ git reset --hard HEAD make world make check-world I get the same failure, with primary key or unique index column showing as 0 in results. I am off on vacation tomorrow and next week. Will dig into this with gdb if not solved when I get back -- unless you have a better suggestion for how to figure it out. Once this is solved, I will be working with testing the final result of all these layers, including creating a second output plugin. I want to confirm that multiple plugins play well together. I'm glad to see other eyes also on this patch set. -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-06-27 18:18:50 -0400, Tom Lane wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: > > I'm looking at the combined patches 0003-0005, which are essentially all > > about adding a function to obtain relation OID from (tablespace, > > filenode). It takes care to look through the relation mapper, and uses > > a new syscache underneath for performance. > > > One question about this patch, originally, was about the usage of > > that relfilenode syscache. It is questionable because it would be the > > only syscache to apply on top of a non-unique index. > > ... which, I assume, is on top of a pg_class index that doesn't exist > today. Exactly what is the argument that says performance of this > function is sufficiently critical to justify adding both the maintenance > overhead of a new pg_class index, *and* a broken-by-design syscache? Ok, so this requires some context. When we do the changeset extraction we build a mvcc snapshot that for every heap wal record is consistent with one made at the time the record has been inserted. Then, when we've built that snapshot, we can use it to turn heap wal records into the representation the user wants: For that we first need to know which table a change comes from, since otherwise we obviously cannot interpret the HeapTuple that's essentially contained in the wal record without it. Since we have a correct mvcc snapshot we can query pg_class for (tablespace, relfilenode) to get back the relation. When we know the relation, the user (i.e. the output pluggin) can use normal backend code to transform the HeapTuple into the target representation, e.g. SQL, since we can build a TupleDesc. Since the syscaches are synchronized with the built snapshot normal output functions can be used. What that means is that for every heap record in the target database in the WAL we need to query pg_class to turn the relfilenode into a pg_class.oid. So, we can easily replace syscache.c with some custom caching code, but I don't think it's realistic to get rid of that index. Otherwise we need to cache the entire pg_class in memory which doesn't sound enticing. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Jun 28, 2013 at 3:32 AM, Andres Freund <andres@2ndquadrant.com> wrote: > What that means is that for every heap record in the target database in > the WAL we need to query pg_class to turn the relfilenode into a > pg_class.oid. So, we can easily replace syscache.c with some custom > caching code, but I don't think it's realistic to get rid of that > index. Otherwise we need to cache the entire pg_class in memory which > doesn't sound enticing. The alternative I previously proposed was to make the WAL records carry the relation OID. There are a few problems with that: one is that it's a waste of space when logical replication is turned off, and it might not be easy to only do it when logical replication is on. Also, even when logic replication is turned on, things that make WAL bigger aren't wonderful. On the other hand, it does avoid the overhead of another index on pg_class. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-06-28 08:41:46 -0400, Robert Haas wrote: > On Fri, Jun 28, 2013 at 3:32 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > What that means is that for every heap record in the target database in > > the WAL we need to query pg_class to turn the relfilenode into a > > pg_class.oid. So, we can easily replace syscache.c with some custom > > caching code, but I don't think it's realistic to get rid of that > > index. Otherwise we need to cache the entire pg_class in memory which > > doesn't sound enticing. > > The alternative I previously proposed was to make the WAL records > carry the relation OID. There are a few problems with that: one is > that it's a waste of space when logical replication is turned off, and > it might not be easy to only do it when logical replication is on. > Also, even when logic replication is turned on, things that make WAL > bigger aren't wonderful. On the other hand, it does avoid the > overhead of another index on pg_class. I personally favor making catalog modifications a bit more more expensive instead of increasing the WAL volume during routine operations. I don't think index maintenance itself comes close to the biggest cost for DDL we have atm. It also increases the modifications needed to imporantant heap_* functions which doesn't make me happy. How do others see this tradeoff? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-06-27 21:52:03 -0700, Kevin Grittner wrote: > Tried that, too, and problem persists. The log shows the last > commit on your branch as 022c2da1873de2fbc93ae524819932719ca41bdb. > I get the same failure, with primary key or unique index column > showing as 0 in results. I have run enough iterations of the test suite locally now that I am confident it's not just happenstance that I don't see this :/. I am going to clone your environment as closely as I can to see where the issue might be as well as going over those codepaths... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 6/28/13 8:46 AM, Andres Freund wrote: > I personally favor making catalog modifications a bit more more > expensive instead of increasing the WAL volume during routine > operations. I don't think index maintenance itself comes close to the > biggest cost for DDL we have atm. That makes sense to me in principle.
Andres Freund <andres@2ndquadrant.com> writes: > On 2013-06-28 08:41:46 -0400, Robert Haas wrote: >> The alternative I previously proposed was to make the WAL records >> carry the relation OID. There are a few problems with that: one is >> that it's a waste of space when logical replication is turned off, and >> it might not be easy to only do it when logical replication is on. >> Also, even when logic replication is turned on, things that make WAL >> bigger aren't wonderful. On the other hand, it does avoid the >> overhead of another index on pg_class. > I personally favor making catalog modifications a bit more more > expensive instead of increasing the WAL volume during routine > operations. This argument is nonsense, since it conveniently ignores the added WAL entries created as a result of additional pg_class index manipulations. Robert's idea sounds fairly reasonable to me; another 4 bytes per insert/update/delete WAL entry isn't that big a deal, and it would probably ease many debugging tasks as well as what you want to do. So I'd vote for including the rel OID all the time, not conditionally. The real performance argument against the patch as you have it is that it saddles every PG installation with extra overhead for pg_class updates whether or not that installation ever has or ever will make use of changeset generation --- unlike including rel OIDs in WAL entries, which might be merely difficult to handle conditionally, it's flat-out impossible to turn such an index on or off. Moreover, even if one is using changeset generation, the overhead is being imposed at the wrong place, ie the master not the slave doing changeset extraction. But that's not the only problem, nor even the worst one IMO. I said before that a syscache with a non-unique key is broken by design, and I stand by that estimate. Even assuming that this usage doesn't create bugs in the code as it stands, it might well foreclose future changes or optimizations that we'd like to make in the catcache code. If you don't want to change WAL contents, what I think you should do is create a new cache mechanism (perhaps by extending the relmapper) that caches relfilenode to OID lookups and acts entirely inside the changeset-generating slave. Hacking up the catcache instead of doing that is an expedient kluge that will come back to bite us. regards, tom lane
On 2013-06-28 10:49:26 -0400, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2013-06-28 08:41:46 -0400, Robert Haas wrote: > >> The alternative I previously proposed was to make the WAL records > >> carry the relation OID. There are a few problems with that: one is > >> that it's a waste of space when logical replication is turned off, and > >> it might not be easy to only do it when logical replication is on. > >> Also, even when logic replication is turned on, things that make WAL > >> bigger aren't wonderful. On the other hand, it does avoid the > >> overhead of another index on pg_class. > > > I personally favor making catalog modifications a bit more more > > expensive instead of increasing the WAL volume during routine > > operations. > > This argument is nonsense, since it conveniently ignores the added WAL > entries created as a result of additional pg_class index manipulations. Huh? Sure, pg_class manipulations get more expensive. But in most clusters pg_class modifications are by far the minority compared to the rest of the updates performed. > Robert's idea sounds fairly reasonable to me; another 4 bytes per > insert/update/delete WAL entry isn't that big a deal, and it would > probably ease many debugging tasks as well as what you want to do. > So I'd vote for including the rel OID all the time, not conditionally. Ok, I can sure live with that. I don't think it's a problem to make it conditionally if we want to. Making it unconditional would sure make WAL debugging in general more pleasant though. > The real performance argument against the patch as you have it is that > it saddles every PG installation with extra overhead for pg_class > updates whether or not that installation ever has or ever will make use > of changeset generation --- unlike including rel OIDs in WAL entries, > which might be merely difficult to handle conditionally, it's flat-out > impossible to turn such an index on or off. Moreover, even if one is > using changeset generation, the overhead is being imposed at the wrong > place, ie the master not the slave doing changeset extraction. There's no required slaves for doing changeset extraction anymore. Various people opposed that pretty violently, so it's now all happening on the master. Which IMHO turned out to be the right decision. We can do it on Hot Standby nodes, but its absolutely not required. > But that's not the only problem, nor even the worst one IMO. I said > before that a syscache with a non-unique key is broken by design, and > I stand by that estimate. Even assuming that this usage doesn't create > bugs in the code as it stands, it might well foreclose future changes or > optimizations that we'd like to make in the catcache code. Since the only duplicate key that possibly can occur in that cache is InvalidOid, I wondered whether we could define a 'filter' that prohibits those ending up in the cache? Then the cache would be unique. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Jun 28, 2013 at 10:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert's idea sounds fairly reasonable to me; another 4 bytes per > insert/update/delete WAL entry isn't that big a deal, ... How big a deal is it? This is a serious question, because I don't know. Let's suppose that the average size of an XLOG_HEAP_INSERT record is 100 bytes. Then if we add 4 bytes, isn't that a 4% overhead? And doesn't that seem significant? I'm just talking out of my rear end here because I don't know what the real numbers are, but it's far from obvious to me that there's any free lunch here. That having been said, just because indexing relfilenode or adding relfilenodes to WAL records is expensive doesn't mean we shouldn't do it. But I think we need to know the price tag before we can judge whether to make the purchase. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas escribió: > On Fri, Jun 28, 2013 at 10:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Robert's idea sounds fairly reasonable to me; another 4 bytes per > > insert/update/delete WAL entry isn't that big a deal, ... > > How big a deal is it? This is a serious question, because I don't > know. Let's suppose that the average size of an XLOG_HEAP_INSERT > record is 100 bytes. Then if we add 4 bytes, isn't that a 4% > overhead? And doesn't that seem significant? An INSERT wal record is: typedef struct xl_heap_insert {xl_heaptid target; /* inserted tuple id */bool all_visible_cleared; /* PD_ALL_VISIBLE was cleared*//* xl_heap_header & TUPLE DATA FOLLOWS AT END OF STRUCT */ } xl_heap_insert; typedef struct xl_heap_header {uint16 t_infomask2;uint16 t_infomask;uint8 t_hoff; } xl_heap_header; So the fixed part is just 7 bytes + 5 bytes; tuple data follows that. So adding four more bytes could indeed be significant (but by how much, depends on the size of the tuple data). Adding a new pg_class index would be larger in the sense that there are more WAL records, and there's the extra vacuuming traffic; but on the other hand that would only happen when tables are created. It seems safe to assume that in normal use cases the ratio of tuple insertion vs. table creation is large. The only idea that springs to mind is to have the new pg_class index be created conditionally, i.e. only when logical replication is going to be used. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Alvaro Herrera escribió: > An INSERT wal record is: > > typedef struct xl_heap_insert > { > xl_heaptid target; /* inserted tuple id */ > bool all_visible_cleared; /* PD_ALL_VISIBLE was cleared */ > /* xl_heap_header & TUPLE DATA FOLLOWS AT END OF STRUCT */ > } xl_heap_insert; Oops. xl_heaptid is not 6 bytes, but instead: typedef struct xl_heaptid {RelFileNode node;ItemPointerData tid; } xl_heaptid; typedef struct RelFileNode {Oid spcNode;Oid dbNode;Oid relNode; } RelFileNode; /* 12 bytes */ typedef struct ItemPointerData {BlockIdData ip_blkid;OffsetNumber ip_posid; }; /* 6 bytes */ typedef struct BlockIdData {uint16 bi_hi;uint16 bi_lo; } BlockIdData; /* 4 bytes */ typedef uint16 OffsetNumber; There's purposely no alignment padding anywhere, so xl_heaptid totals 22 bytes. Therefore, > So the fixed part is just 22 bytes + 5 bytes; tuple data follows that. > So adding four more bytes could indeed be significant (but by how much, > depends on the size of the tuple data). 4 extra bytes on top of 27 is 14% of added overhead (considering only the xlog header.) -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Robert Haas <robertmhaas@gmail.com> writes: > I'm just talking out of my rear end here because I don't know what the > real numbers are, but it's far from obvious to me that there's any > free lunch here. That having been said, just because indexing > relfilenode or adding relfilenodes to WAL records is expensive doesn't > mean we shouldn't do it. But I think we need to know the price tag > before we can judge whether to make the purchase. Certainly, any of these solutions are going to cost us somewhere --- either up-front cost or more expensive (and less reliable?) changeset extraction, take your choice. I will note that somehow tablespaces got put in despite having to add 4 bytes to every WAL record for that feature, which was probably of less use than logical changeset extraction will be. But to tell the truth, I'm mostly exercised about the non-unique syscache. I think that's simply a *bad* idea. regards, tom lane
On Fri, Jun 28, 2013 at 11:56 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> I'm just talking out of my rear end here because I don't know what the >> real numbers are, but it's far from obvious to me that there's any >> free lunch here. That having been said, just because indexing >> relfilenode or adding relfilenodes to WAL records is expensive doesn't >> mean we shouldn't do it. But I think we need to know the price tag >> before we can judge whether to make the purchase. > > Certainly, any of these solutions are going to cost us somewhere --- > either up-front cost or more expensive (and less reliable?) changeset > extraction, take your choice. I will note that somehow tablespaces got > put in despite having to add 4 bytes to every WAL record for that > feature, which was probably of less use than logical changeset > extraction will be. Right. I actually think we booted that one. The database ID is a constant for most people. The tablespace ID is not technically redundant, but in 99.99% of cases you could figure it out from the database ID + relation ID. The relation ID is where 99% of the entropy is, but it probably only has 8-16 bits of entropy in most real-world use cases. If we were doing this over we might want to think about storing a proxy for the relfilenode rather than the relfilenode itself, but there's not much good crying over it now. > But to tell the truth, I'm mostly exercised about the non-unique > syscache. I think that's simply a *bad* idea. +1. I don't think the extra index on pg_class is going to hurt that much, even if we create it always, as long as we use a purpose-built caching mechanism for it rather than forcing it through catcache. The people who are going to suffer are the ones who create and drop a lot of temporary tables, but even there I'm not sure how visible the overhead will be on real-world workloads, and maybe the solution is to work towards not having permanent catalog entries for temporary tables in the first place. In any case, hurting people who use temporary tables heavily seems better than adding overhead to every insert/update/delete operation, which will hit all users who are not read-only. On the other hand, I can't entirely shake the feeling that adding the information into WAL would be more reliable. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On the other hand, I can't entirely shake the feeling that adding the > information into WAL would be more reliable. That feeling has been nagging at me too. I can't demonstrate that there's a problem when an ALTER TABLE is in process of rewriting a table into a new relfilenode number, but I don't have a warm fuzzy feeling about the reliability of reverse lookups for this. At the very least it's going to require some hard-to-verify restriction about how we can't start doing changeset reconstruction in the middle of a transaction that's doing DDL. regards, tom lane
On 28 June 2013 17:10, Robert Haas <robertmhaas@gmail.com> wrote:
> But to tell the truth, I'm mostly exercised about the non-unique> syscache. I think that's simply a *bad* idea.+1.
I don't think the extra index on pg_class is going to hurt that much,
even if we create it always, as long as we use a purpose-built caching
mechanism for it rather than forcing it through catcache.
Hmm, does seem like that would be better.
The people
who are going to suffer are the ones who create and drop a lot of
temporary tables, but even there I'm not sure how visible the overhead
will be on real-world workloads, and maybe the solution is to work
towards not having permanent catalog entries for temporary tables in
the first place. In any case, hurting people who use temporary tables
heavily seems better than adding overhead to every
insert/update/delete operation, which will hit all users who are not
read-only.
Thinks...
If we added a trigger that fired a NOTIFY for any new rows in pg_class that relate to non-temporary relations that would optimise away any overhead for temporary tables or when no changeset extraction was in progress.
The changeset extraction could build a private hash table to perform the lookup and then LISTEN on a specific channel for changes.
That might work better than an index-plus-syscache.
On the other hand, I can't entirely shake the feeling that adding the
information into WAL would be more reliable.
I don't really like the idea of requiring the relid on the WAL record. WAL is big enough already and we want people to turn this on, not avoid it.
This is just an index lookup. We do them all the time without any fear of reliability issues.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 2013-06-28 12:26:52 -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On the other hand, I can't entirely shake the feeling that adding the > > information into WAL would be more reliable. > > That feeling has been nagging at me too. I can't demonstrate that > there's a problem when an ALTER TABLE is in process of rewriting a table > into a new relfilenode number, but I don't have a warm fuzzy feeling > about the reliability of reverse lookups for this. I am pretty sure the mapping thing works, but it indeed requires some complexity. And it's harder to debug because when you want to understand what's going on the relfilenodes involved aren't in the catalog anymore. > At the very least > it's going to require some hard-to-verify restriction about how we > can't start doing changeset reconstruction in the middle of a > transaction that's doing DDL. Currently changeset extraction needs to wait (and does so) till it found a point where it has seen the start of all in-progress transactions. All transaction that *commit* after the last partiall observed in-progress transaction finished can be decoded. To make that point visible for external tools to synchronize - e.g. pg_dump - it exports the snapshot of exactly the moment when that last in-progress transaction committed. So, from what I gather there's a slight leaning towards *not* storing the relation's oid in the WAL. Which means the concerns about the uniqueness issues with the syscaches need to be addressed. So far I know of three solutions: 1) develop a custom caching/mapping module 2) Make sure InvalidOid's (the only possible duplicate) can't end up the syscache by adding a hook that prevents that onthe catcache level 3) Make sure that there can't be any duplicates by storing the oid of the relation in a mapped relations relfilenode Opinions? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Since this discussion seems to have stalled, let me do a quick summary. The goal of this subset of patches is to allow retroactive look up of relations starting from a WAL record. Currently, the WAL record only tracks the relfilenode that it affects, so there are two possibilities: 1. we add some way to find out the relation OID from the relfilenode, 2. we augment the WAL record with the relation OID. Each solution has its drawbacks. For the former, * we need a new cache * we need a new pg_class index * looking up the relation OID still requires some CPU runtime and memory to keep the caches in; run invalidations, etc. For the latter, * each WAL record would become somewhat bigger. For WAL records with a payload of 25 bytes (say insert a tuple which is25 bytes long) this means about 7% overhead. There are some other issues, but these can be solved. For instance Tom doesn't want a syscache on top of a non-unique index, and I agree on that. But if we agree on this way forward, we can just go a different route by keeping a separate cache layer. So the question is, do we take the overhead of the new index (which means overhead on DML operations -- supposedly rare) or do we take the overhead of larger WAL records (which means overhead on all DDL operations)? Note we can make either thing apply to only people running logical replication. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > So the question is, do we take the overhead of the new index (which > means overhead on DML operations -- supposedly rare) or do we take the > overhead of larger WAL records (which means overhead on all DDL > operations)? > Note we can make either thing apply to only people running logical > replication. I don't believe you can have or not have an index on pg_class as easily as all that. The choice would have to be frozen at initdb time, so people would have to pay the overhead if they thought there was even a small possibility that they'd want logical replication later. Flipping the content of WAL records might not be a terribly simple thing to do either, but at least in principle it could be done during a postmaster restart, without initdb. regards, tom lane
On 2013-07-01 14:16:55 -0400, Tom Lane wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: > > So the question is, do we take the overhead of the new index (which > > means overhead on DML operations -- supposedly rare) or do we take the > > overhead of larger WAL records (which means overhead on all DDL > > operations)? > > > Note we can make either thing apply to only people running logical > > replication. > > I don't believe you can have or not have an index on pg_class as easily > as all that. The choice would have to be frozen at initdb time, so > people would have to pay the overhead if they thought there was even a > small possibility that they'd want logical replication later. It should be possible to create the index in a single database when we start logical replication in that database? Running the index creation with a fixed oid shouldn't require too much code. The oid won't be reused by other pg_class entries since it would be a system one. Alternatively we could always create the index's pg_class/index entry but mark it as !indislive when logical replication isn't active for that database. Then activating it would just require rebuilding that index. But then, I am not fully convinced that's worth the trouble since I don't think pg_class index maintenance is the painspot in DDL atm. > Flipping the content of WAL records might not be a terribly simple thing > to do either, but at least in principle it could be done during a > postmaster restart, without initdb. The main patch combines various booleans in the heap wal records into a flags variable, so there should be enough space to keep track of it without increasing size. Makes size calculations a bit more annoying though as we use the xlog record length to calculate the heap tuple's length, but that's not a large problem. So we could just set the XLOG_HEAP_CONTAINS_CLASSOID flag if wal_level >= WAL_LEVEL_LOGICAL. Wal decoding then can throw a tantrum if it finds a record without it and we're done. We could even make that per database, but that seems to be something for the future. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 27 June 2013 23:18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Exactly what is the argument that says performance of thisfunction is sufficiently critical to justify adding both the maintenance
overhead of a new pg_class index, *and* a broken-by-design syscache?
I think we all agree on changing the syscache.
I'm not clear why adding a new permanent index to pg_class is such a problem. It's going to be a very thin index. I'm trying to imagine a use case that has pg_class index maintenance as a major part of its workload and I can't. An extra index on pg_attribute and I might agree with you. The pg_class index would only be a noticeable % of catalog rows for very thin temp tables, but would still even then be small; that isn't even necessary work since we all agree that temp table overheads could and should be optimised away somwhere. So blocking a new index because of that sounds strange.
What issues do you foresee? How can we test them?
Or perhaps we should just add the index and see if we later discover a measurable problem workload?
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 2013-06-27 21:52:03 -0700, Kevin Grittner wrote: > Andres Freund <andres@2ndquadrant.com> wrote: > > > Hm. There were some issues with the test_logical_decoding > > Makefile not cleaning up the regression installation properly. > > Which might have caused the issue. > > > > Could you try after applying the patches and executing a clean > > and then rebuild? > > Tried, and problem persists. > > > Otherwise, could you try applying my git tree so we are sure we > > test the same thing? > > > > $ git remote add af git://git.postgresql.org/git/users/andresfreund/postgres.git > > $ git fetch af > > $ git checkout -b xlog-decoding af/xlog-decoding-rebasing-cf4 > > $ ./configure ... > > $ make > > Tried that, too, and problem persists. The log shows the last > commit on your branch as 022c2da1873de2fbc93ae524819932719ca41bdb. Ok. I think I have a slight idea what's going on. Could you check whether recompiling with -O0 "fixes" the issue? There's something strange going on here, not sure whether it's just a bug that's hidden, by either not doing optimizations or by adding more elog()s, or wheter it's a compiler bug. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-07-05 14:03:56 +0200, Andres Freund wrote: > On 2013-06-27 21:52:03 -0700, Kevin Grittner wrote: > > Andres Freund <andres@2ndquadrant.com> wrote: > > > > > Hm. There were some issues with the test_logical_decoding > > > Makefile not cleaning up the regression installation properly. > > > Which might have caused the issue. > > > > > > Could you try after applying the patches and executing a clean > > > and then rebuild? > > > > Tried, and problem persists. > > > > > Otherwise, could you try applying my git tree so we are sure we > > > test the same thing? > > > > > > $ git remote add af git://git.postgresql.org/git/users/andresfreund/postgres.git > > > $ git fetch af > > > $ git checkout -b xlog-decoding af/xlog-decoding-rebasing-cf4 > > > $ ./configure ... > > > $ make > > > > Tried that, too, and problem persists. The log shows the last > > commit on your branch as 022c2da1873de2fbc93ae524819932719ca41bdb. > > Ok. I think I have a slight idea what's going on. Could you check > whether recompiling with -O0 "fixes" the issue? > > There's something strange going on here, not sure whether it's just a > bug that's hidden, by either not doing optimizations or by adding more > elog()s, or wheter it's a compiler bug. Ok. It was supreme stupidity on my end. Sorry for the time you spent on it. Some versions of gcc (and probably other compilers) were removing sections of code when optimizing because the code was doing undefined things. Parts of the rdata chain were allocated locally in an if (needs_key). Which obviously is utterly bogus... A warning would have been nice though. Fix pushed and attached. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 07/05/2013 08:03 AM, Andres Freund wrote: > On 2013-06-27 21:52:03 -0700, Kevin Grittner wrote: >> Tried that, too, and problem persists. The log shows the last commit >> on your branch as 022c2da1873de2fbc93ae524819932719ca41bdb. > Ok. I think I have a slight idea what's going on. Could you check > whether recompiling with -O0 "fixes" the issue? > > There's something strange going on here, not sure whether it's just a > bug that's hidden, by either not doing optimizations or by adding more > elog()s, or wheter it's a compiler bug. I am getting the same test failure Kevin is seeing. This is on a x64 Debian wheezy machine with gcc (Debian 4.7.2-5) 4.7.2 Building with -O0 results in passing tests. > Greetings, > > Andres Freund >
On 2013-07-05 09:28:45 -0400, Steve Singer wrote: > On 07/05/2013 08:03 AM, Andres Freund wrote: > >On 2013-06-27 21:52:03 -0700, Kevin Grittner wrote: > >>Tried that, too, and problem persists. The log shows the last commit on > >>your branch as 022c2da1873de2fbc93ae524819932719ca41bdb. > >Ok. I think I have a slight idea what's going on. Could you check > >whether recompiling with -O0 "fixes" the issue? > > > >There's something strange going on here, not sure whether it's just a > >bug that's hidden, by either not doing optimizations or by adding more > >elog()s, or wheter it's a compiler bug. > > I am getting the same test failure Kevin is seeing. > This is on a x64 Debian wheezy machine with > gcc (Debian 4.7.2-5) 4.7.2 > > Building with -O0 results in passing tests. Does the patch from http://archives.postgresql.org/message-id/20130705132513.GB11640%40awork2.anarazel.de or the git tree (which is rebased ontop of the mvcc catalog commit from robert which needs some changes) fix it, even with optimizations? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 07/05/2013 09:34 AM, Andres Freund wrote: > On 2013-07-05 09:28:45 -0400, Steve Singer wrote: >> On 07/05/2013 08:03 AM, Andres Freund wrote: >>> On 2013-06-27 21:52:03 -0700, Kevin Grittner wrote: >>>> Tried that, too, and problem persists. The log shows the last commit on >>>> your branch as 022c2da1873de2fbc93ae524819932719ca41bdb. >>> Ok. I think I have a slight idea what's going on. Could you check >>> whether recompiling with -O0 "fixes" the issue? >>> >>> There's something strange going on here, not sure whether it's just a >>> bug that's hidden, by either not doing optimizations or by adding more >>> elog()s, or wheter it's a compiler bug. >> I am getting the same test failure Kevin is seeing. >> This is on a x64 Debian wheezy machine with >> gcc (Debian 4.7.2-5) 4.7.2 >> >> Building with -O0 results in passing tests. > Does the patch from > http://archives.postgresql.org/message-id/20130705132513.GB11640%40awork2.anarazel.de > or the git tree (which is rebased ontop of the mvcc catalog commit from > robert which needs some changes) fix it, even with optimizations? Yes your latest git tree the tests pass with -O2 > Greetings, > > Andres Freund >
On 06/14/2013 06:51 PM, Andres Freund wrote: > The git tree is at: > git://git.postgresql.org/git/users/andresfreund/postgres.git branch xlog-decoding-rebasing-cf4 > http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-decoding-rebasing-cf4 We discussed issues related to passing options to the plugins a number of months ago ( http://www.postgresql.org/message-id/20130129015732.GA24238@awork2.anarazel.de) I'm still having issues with the syntax you describe there. START_LOGICAL_REPLICATION "1" 0/0 ("foo","bar") unexpected termination of replication stream: ERROR: foo requires a parameter START_LOGICAL_REPLICATION "1" 0/0 ("foo" "bar") "START_LOGICAL_REPLICATION "1" 0/0 ("foo" "bar")": ERROR: syntax error START_LOGICAL_REPLICATION "1" 0/0 ("foo") works okay Steve > On 2013-06-15 00:48:17 +0200, Andres Freund wrote: >> Overview of the attached patches: >> 0001: indirect toast tuples; required but submitted independently >> 0002: functions for testing; not required, >> 0003: (tablespace, filenode) syscache; required >> 0004: RelationMapFilenodeToOid: required, simple >> 0005: pg_relation_by_filenode() function; not required but useful >> 0006: Introduce InvalidCommandId: required, simple >> 0007: Adjust Satisfies* interface: required, mechanical, >> 0008: Allow walsender to attach to a database: required, needs review >> 0009: New GetOldestXmin() parameter; required, pretty boring >> 0010: Log xl_running_xact regularly in the bgwriter: required >> 0011: make fsync_fname() public; required, needs to be in a different file >> 0012: Relcache support for an Relation's primary key: required >> 0013: Actual changeset extraction; required >> 0014: Output plugin demo; not required (except for testing) but useful >> 0015: Add pg_receivellog program: not required but useful >> 0016: Add test_logical_decoding extension; not required, but contains >> the tests for the feature. Uses 0014 >> 0017: Snapshot building docs; not required > Version v5-01 attached > > Greetings, > > Andres Freund > > >
On 2013-07-05 11:33:20 -0400, Steve Singer wrote: > On 06/14/2013 06:51 PM, Andres Freund wrote: > >The git tree is at: > >git://git.postgresql.org/git/users/andresfreund/postgres.git branch xlog-decoding-rebasing-cf4 > >http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-decoding-rebasing-cf4 > > > We discussed issues related to passing options to the plugins a number of > months ago ( http://www.postgresql.org/message-id/20130129015732.GA24238@awork2.anarazel.de) > > I'm still having issues with the syntax you describe there. > > > START_LOGICAL_REPLICATION "1" 0/0 ("foo","bar") > unexpected termination of replication stream: ERROR: foo requires a > parameter I'd guess that's coming from your output plugin? You're using defGetString() on DefElem without a value? > START_LOGICAL_REPLICATION "1" 0/0 ("foo" "bar") Yes, the option *names* are identifiers, together with plugin & slot names. The passed values need to be SCONSTs atm (src/backend/replication/repl_gram.y): plugin_opt_elem: IDENT plugin_opt_arg { $$ = makeDefElem($1, $2); } ; plugin_opt_arg: SCONST { $$ = (Node *) makeString($1); } | /* EMPTY */ { $$ = NULL; } ; So, it would have to be: START_LOGICAL_REPLICATION "1" 0/0 ("foo" 'bar blub frob', "sup" 'star', "noarg") Now that's not completely obvious, I admit :/. Better suggestions? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-06-28 21:47:47 +0200, Andres Freund wrote: > So, from what I gather there's a slight leaning towards *not* storing > the relation's oid in the WAL. Which means the concerns about the > uniqueness issues with the syscaches need to be addressed. So far I know > of three solutions: > 1) develop a custom caching/mapping module > 2) Make sure InvalidOid's (the only possible duplicate) can't end up the > syscache by adding a hook that prevents that on the catcache level > 3) Make sure that there can't be any duplicates by storing the oid of > the relation in a mapped relations relfilenode So, here's 4 patches: 1) add RelationMapFilenodeToOid() 2) Add pg_class index on (reltablespace, relfilenode) 3a) Add custom cache that maps from filenode to oid 3b) Add catcache 'filter' that ensures the cache stays unique and use that for the mapping 4) Add pg_relation_by_filenode() and use it in a regression test 3b) adds an optional 'filter' attribute to struct cachedesc in syscache.c which is then passed to catcache.c. If it's existant catcache.c uses it - after checking for a match in the cache - to check whether the queried-for value possibly should end up in the cache. If not it stores a whiteout entry as currently already done for nonexistant entries. It also reorders some catcache.h struct attributes to make sure we're not growing them. Might make sense to apply that independently, those are rather heavily used. I slightly prefer 3b) because it's smaller, what's your opinions? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Andres Freund <andres@2ndquadrant.com> writes: > 3b) Add catcache 'filter' that ensures the cache stays unique and use > that for the mapping > I slightly prefer 3b) because it's smaller, what's your opinions? This is just another variation on the theme of kluging the catcache to do something it shouldn't. You're still building a catcache on a non-unique index, and that is going to lead to trouble. (I'm a bit surprised that there is no Assert in catcache.c checking that the index nominated to support a catcache is unique ...) regards, tom lane
On 2013-07-07 15:43:17 -0400, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > 3b) Add catcache 'filter' that ensures the cache stays unique and use > > that for the mapping > > > I slightly prefer 3b) because it's smaller, what's your opinions? > > This is just another variation on the theme of kluging the catcache to > do something it shouldn't. You're still building a catcache on a > non-unique index, and that is going to lead to trouble. I don't think the lurking dangers really are present. The index essentially *is* unique since we filter away anything non-unique. The catcache code hardly can be confused by tuples it never sees. That would even work if we started preloading catcaches by doing scans of the entire underlying relation or by caching all of a page when reading one of its tuples. I can definitely see that there are "aesthetical" reasons against doing 3b), that's why I've also done 3a). So I'll chalk you up to voting for that... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Sorry for the delay in reviewing this. I must make sure never to take another vacation during a commitfest -- the backlog upon return is a killer.... Kevin Grittner <kgrittn@ymail.com> wrote: > Andres Freund <andres@2ndquadrant.com> wrote: >> Otherwise, could you try applying my git tree so we are sure we >> test the same thing? >> >> $ git remote add af git://git.postgresql.org/git/users/andresfreund/postgres.git >> $ git fetch af >> $ git checkout -b xlog-decoding af/xlog-decoding-rebasing-cf4 >> $ ./configure ... >> $ make > > Tried that, too, and problem persists. The log shows the last > commit on your branch as 022c2da1873de2fbc93ae524819932719ca41bdb. The good news: the regression tests now work for me, and I'm back on testing this at a high level. The bad news: (1) The code checked out from that branch does not merge with master. Not surprisingly, given the recent commits, xlog.c is a problem. Is there another branch I should now be using? If not, please let me know when I can test with something that applies on top of the master branch. (2) An initial performance test didn't look very good. I will be running a more controlled test to confirm but the logical replication of a benchmark with a lot of UPDATEs of compressed text values seemed to suffer with the logical replication turned on. Any suggestions or comments on that front, before I run the more controlled benchmarks? -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-07-10 12:21:23 -0700, Kevin Grittner wrote: > Sorry for the delay in reviewing this. I must make sure never to > take another vacation during a commitfest -- the backlog upon > return is a killer.... Heh. Yes. Been through it before... > Kevin Grittner <kgrittn@ymail.com> wrote: > > Andres Freund <andres@2ndquadrant.com> wrote: > > >> Otherwise, could you try applying my git tree so we are sure we > >> test the same thing? > >> > >> $ git remote add af git://git.postgresql.org/git/users/andresfreund/postgres.git > >> $ git fetch af > >> $ git checkout -b xlog-decoding af/xlog-decoding-rebasing-cf4 > >> $ ./configure ... > >> $ make > > > > Tried that, too, and problem persists. The log shows the last > > commit on your branch as 022c2da1873de2fbc93ae524819932719ca41bdb. > > The good news: the regression tests now work for me, and I'm back > on testing this at a high level. > The bad news: > > (1) The code checked out from that branch does not merge with > master. Not surprisingly, given the recent commits, xlog.c is a > problem. Is there another branch I should now be using? If not, > please let me know when I can test with something that applies on > top of the master branch. That one is actually relatively easy to resolve. The mvcc catalog scan patch is slightly harder. I've pushed an updated patch that fixes the latter in a slightly not-so-nice way. I am not sure yet how the final fix for that's going to look like, depends on whether we will get rid of SnapshotNow alltogether... I'll push my local tree with that fixed in a sec. > (2) An initial performance test didn't look very good. I will be > running a more controlled test to confirm but the logical > replication of a benchmark with a lot of UPDATEs of compressed text > values seemed to suffer with the logical replication turned on. > Any suggestions or comments on that front, before I run the more > controlled benchmarks? Hm. There theoretically shouldn't actually be anything added in that path. Could you roughly sketch what that test is doing? Do you actually stream those changes out or did you just turn on wal_level=logical? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> wrote: > Kevin Grittner <kgrittn@ymail.com> wrote: >> (2) An initial performance test didn't look very good. I will be >> running a more controlled test to confirm but the logical >> replication of a benchmark with a lot of UPDATEs of compressed text >> values seemed to suffer with the logical replication turned on. >> Any suggestions or comments on that front, before I run the more >> controlled benchmarks? > > Hm. There theoretically shouldn't actually be anything added in that > path. Could you roughly sketch what that test is doing? Do you actually > stream those changes out or did you just turn on wal_level=logical? It was an update of a every row in a table of 720000 rows, with each row updated by primary key using a separate UPDATE statement, modifying a large text column with a lot of repeating characters (so compressed well). I got a timing on a master build and I got a timing with the patch in the environment used by test_logical_decoding. It took several times as long in the latter run, but it was very much a preliminary test in preparation for getting real numbers. (I'm sure you know how much work it is to set up for a good run of tests.) I'm not sure that (for example) the synchronous_commit setting was the same, which could matter a lot. I wouldn't put a lot of stock in it until I can re-create it under a much more controlled test. The one thing about the whole episode that gave me pause was that the compression and decompression routines were very high on the `perf top` output in the patched run and way down the list on the run based on master. I don't have a ready explanation for that, unless your branch was missing a recent commit for speeding compression which was present on master. It might be worth checking that you're not detoasting more often than you need. -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-07-10 15:14:58 -0700, Kevin Grittner wrote: > Andres Freund <andres@2ndquadrant.com> wrote: > > Kevin Grittner <kgrittn@ymail.com> wrote: > > >> (2) An initial performance test didn't look very good. I will be > >> running a more controlled test to confirm but the logical > >> replication of a benchmark with a lot of UPDATEs of compressed text > >> values seemed to suffer with the logical replication turned on. > >> Any suggestions or comments on that front, before I run the more > >> controlled benchmarks? > > > > Hm. There theoretically shouldn't actually be anything added in that > > path. Could you roughly sketch what that test is doing? Do you actually > > stream those changes out or did you just turn on wal_level=logical? > > It was an update of a every row in a table of 720000 rows, with > each row updated by primary key using a separate UPDATE statement, > modifying a large text column with a lot of repeating characters > (so compressed well). I got a timing on a master build and I got a > timing with the patch in the environment used by > test_logical_decoding. It took several times as long in the latter > run, but it was very much a preliminary test in preparation for > getting real numbers. (I'm sure you know how much work it is to > set up for a good run of tests.) I'm not sure that (for example) > the synchronous_commit setting was the same, which could matter a > lot. I wouldn't put a lot of stock in it until I can re-create it > under a much more controlled test. So you didn't explicitly start anything to consume those changes? I.e. using pg_receivellog or SELECT * FROM start/init_logical_replication(...)? Any chance there still was an old replication slot around? SELECT * FROM pg_stat_logical_decoding; should show them. But theoretically the make check in test_logical_decoding should finish without one active... > The one thing about the whole episode that gave me pause was that > the compression and decompression routines were very high on the > `perf top` output in the patched run and way down the list on the > run based on master. That's interesting. Unless there's something consuming the changestream and the output plugin does something that actually requests decompression of the Datums there shouldn't be *any* added/removed calls to toast (de-)compression... While consuming the changes there could be ReorderBufferToast* calls in the profile. I haven't yet seem them in profiles, but that's not saying all that much. So: > I don't have a ready explanation for that, unless your branch was > missing a recent commit for speeding compression which was present on > master. It didn't have 031cc55bbea6b3a6b67c700498a78fb1d4399476 - but I can't really imagine that making *such* a big difference. But maybe you hit some sweet spot with the data? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> wrote: > Any chance there still was an old replication slot around? It is quite likely that there was. -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Jul 7, 2013 at 4:34 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-07-07 15:43:17 -0400, Tom Lane wrote: >> Andres Freund <andres@2ndquadrant.com> writes: >> > 3b) Add catcache 'filter' that ensures the cache stays unique and use >> > that for the mapping >> >> > I slightly prefer 3b) because it's smaller, what's your opinions? >> >> This is just another variation on the theme of kluging the catcache to >> do something it shouldn't. You're still building a catcache on a >> non-unique index, and that is going to lead to trouble. > > I don't think the lurking dangers really are present. The index > essentially *is* unique since we filter away anything non-unique. The > catcache code hardly can be confused by tuples it never sees. That would > even work if we started preloading catcaches by doing scans of the > entire underlying relation or by caching all of a page when reading one > of its tuples. > > I can definitely see that there are "aesthetical" reasons against doing > 3b), that's why I've also done 3a). So I'll chalk you up to voting for > that... I also vote for (3a). I did a quick once over of 1, 2, and 3a and they look reasonable. Barring strenuous objections, I'd like to go ahead and commit these, or perhaps an updated version of them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jul 16, 2013 at 9:00 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sun, Jul 7, 2013 at 4:34 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> On 2013-07-07 15:43:17 -0400, Tom Lane wrote: >>> Andres Freund <andres@2ndquadrant.com> writes: >>> > 3b) Add catcache 'filter' that ensures the cache stays unique and use >>> > that for the mapping >>> >>> > I slightly prefer 3b) because it's smaller, what's your opinions? >>> >>> This is just another variation on the theme of kluging the catcache to >>> do something it shouldn't. You're still building a catcache on a >>> non-unique index, and that is going to lead to trouble. >> >> I don't think the lurking dangers really are present. The index >> essentially *is* unique since we filter away anything non-unique. The >> catcache code hardly can be confused by tuples it never sees. That would >> even work if we started preloading catcaches by doing scans of the >> entire underlying relation or by caching all of a page when reading one >> of its tuples. >> >> I can definitely see that there are "aesthetical" reasons against doing >> 3b), that's why I've also done 3a). So I'll chalk you up to voting for >> that... > > I also vote for (3a). I did a quick once over of 1, 2, and 3a and > they look reasonable. Barring strenuous objections, I'd like to go > ahead and commit these, or perhaps an updated version of them. Hearing no objections, I have done this. Per off-list discussion with Andres, I also included patch 4, which gives us regression test coverage for this code, and have fixed a few bugs and a bunch of stylistic things that bugged me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jun 14, 2013 at 6:51 PM, Andres Freund <andres@2ndquadrant.com> wrote: > The git tree is at: > git://git.postgresql.org/git/users/andresfreund/postgres.git branch xlog-decoding-rebasing-cf4 > http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-decoding-rebasing-cf4 > > On 2013-06-15 00:48:17 +0200, Andres Freund wrote: >> Overview of the attached patches: >> 0001: indirect toast tuples; required but submitted independently >> 0002: functions for testing; not required, >> 0003: (tablespace, filenode) syscache; required >> 0004: RelationMapFilenodeToOid: required, simple >> 0005: pg_relation_by_filenode() function; not required but useful >> 0006: Introduce InvalidCommandId: required, simple >> 0007: Adjust Satisfies* interface: required, mechanical, >> 0008: Allow walsender to attach to a database: required, needs review >> 0009: New GetOldestXmin() parameter; required, pretty boring >> 0010: Log xl_running_xact regularly in the bgwriter: required >> 0011: make fsync_fname() public; required, needs to be in a different file >> 0012: Relcache support for an Relation's primary key: required >> 0013: Actual changeset extraction; required >> 0014: Output plugin demo; not required (except for testing) but useful >> 0015: Add pg_receivellog program: not required but useful >> 0016: Add test_logical_decoding extension; not required, but contains >> the tests for the feature. Uses 0014 >> 0017: Snapshot building docs; not required I've now also committed patch #7 from this series. My earlier commit fulfilled the needs of patches #3, #4, and #5; and somewhat longer ago I committed #1. I am not entirely convinced of the necessity or desirability of patch #6, but as of now I haven't studied the issues closely. Patch #2 does not seem useful in isolation; it adds new regression-testing stuff but doesn't use it anywhere. I doubt that any of the remaining patches (#8-#17) can be applied separately without understanding the shape of the whole patch set, so I think I, or someone else, will need to set aside more time for detailed review before proceeding further with this patch set. I suggest that we close out the CommitFest entry for this patch set one way or another, as there is no way we're going to get the whole thing done under the auspices of CF1. I'll try to find some more time to spend on this relatively soon, but I think this is about as far as I can take this today. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-07-22 13:50:08 -0400, Robert Haas wrote: > On Fri, Jun 14, 2013 at 6:51 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > The git tree is at: > > git://git.postgresql.org/git/users/andresfreund/postgres.git branch xlog-decoding-rebasing-cf4 > > http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-decoding-rebasing-cf4 > > > > On 2013-06-15 00:48:17 +0200, Andres Freund wrote: > >> Overview of the attached patches: > >> 0001: indirect toast tuples; required but submitted independently > >> 0002: functions for testing; not required, > >> 0003: (tablespace, filenode) syscache; required > >> 0004: RelationMapFilenodeToOid: required, simple > >> 0005: pg_relation_by_filenode() function; not required but useful > >> 0006: Introduce InvalidCommandId: required, simple > >> 0007: Adjust Satisfies* interface: required, mechanical, > >> 0008: Allow walsender to attach to a database: required, needs review > >> 0009: New GetOldestXmin() parameter; required, pretty boring > >> 0010: Log xl_running_xact regularly in the bgwriter: required > >> 0011: make fsync_fname() public; required, needs to be in a different file > >> 0012: Relcache support for an Relation's primary key: required > >> 0013: Actual changeset extraction; required > >> 0014: Output plugin demo; not required (except for testing) but useful > >> 0015: Add pg_receivellog program: not required but useful > >> 0016: Add test_logical_decoding extension; not required, but contains > >> the tests for the feature. Uses 0014 > >> 0017: Snapshot building docs; not required > > I've now also committed patch #7 from this series. My earlier commit > fulfilled the needs of patches #3, #4, and #5; and somewhat longer ago > I committed #1. Thanks! > I am not entirely convinced of the necessity or > desirability of patch #6, but as of now I haven't studied the issues > closely. Fair enough. It's certainly possible to work around not having it, but it seems cleaner to introduce the notion of an invalid CommandId like we have for transaction ids et al. Allowing 2^32-2 instead of 2^32-1 subtransactions doesn't seem like a problem to me ;) > Patch #2 does not seem useful in isolation; it adds new > regression-testing stuff but doesn't use it anywhere. Yes. I found it useful to test stuff around making replication synchronous or such, but while I think we should have a facility like it in core for both, logical and physical replication, I don't think this patch is ready for prime time due to it's busy looping. I've even marked it as such above ;) My first idea to properly implement that seems to be to reuse the syncrep infrastructure but that doesn't look trivial. > I doubt that any of the remaining patches (#8-#17) can be applied > separately without understanding the shape of the whole patch set, so > I think I, or someone else, will need to set aside more time for > detailed review before proceeding further with this patch set. I > suggest that we close out the CommitFest entry for this patch set one > way or another, as there is no way we're going to get the whole thing > done under the auspices of CF1. Generally agreed. The biggest chunk of the code is in #13 anyway... Some may be applyable independently: > 0010: Log xl_running_xact regularly in the bgwriter: required Should be useful independently since it can significantlyspeed up startup of physical replicas. Ony many systems checkpoint_timeout will be set to an hour whichcan make the time till a standby gets consistent be quite high since that will be first time it sees a xl_running_xactsagain. > 0011: make fsync_fname() public; required, needs to be in a different file Isn't in the shape for it atm, but could beapplied as an independent infrastructure patch. And it should be easy enough to clean it up. > 0012: Relcache support for an Relation's primary key: required Might actually be a good idea independently as well. E.g.the materalized key patch could use the information that there's a candidate key around to avoid a good bit of uselesswork. > I'll try to find some more time to spend on this relatively soon, but > I think this is about as far as I can take this today. Was pretty helpful already, so ... ;) Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund wrote: > The git tree is at: > git://git.postgresql.org/git/users/andresfreund/postgres.git branch xlog-decoding-rebasing-cf4 > http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-decoding-rebasing-cf4 I gave this recently rebased branch a skim. In general, the separation between decode.c/reorderbuffer.c/snapbuild.c seems a lot nicer now than on previous iterations -- good job there. Here are some quick notes I took while reading the patch itself. I haven't gone through it really carefully, yet. - I wonder whether DecodeCommit and DecodeAbort should really be a single routine. Right now, the former might call thelater; and the latter is aware of this. Seems awkward. - We skip insert/update/delete if not my database Id; however, we don't skip commit in the same case. If there are two walrecvrson a cluster, on different databases, does this lead to us trying to remove files twice, if a xact commits whichdeleted some files? Is this a problem? Should we try to skip such database-specific actions in global WAL records? - There's rmgr-specific knowledge in decode.c. I wonder if, similar to redo and desc routines, that shouldn't instead bepluggable functions for each rmgr. - What's with ReorderBufferRestoreCleanup()? Shouldn't it be in logical.c? - reorderbuffer.c does several different things. Can it be split? Perhaps in pieces such as * stuff to manage memory (slabcache thingies) * TXN iterator * other logically separate parts? * the rest - Having to expose LocalExecuteInvalidationMessage() looks awkward. Is there another way? - I think we need a better name for "treat_as_catalog_table" (and RelationIsTreatedAsCatalogTable). Maybe replication_catalogor something similar? - Don't do this: + * RecentGlobal(Data)?Xmin is initialized to InvalidTransactionId, to ensure that no because later grepsfor RecentGlobalDataXmin and RecentGlobalXmin will fail to find it. It seems better to spell both names, so "RecentGlobalDataXminand RecentGlobalXmin are initialized to ..." - the pg_receivellog command line is strange. Apparently I need one or more of --start,--init,--stop, but if stop, thenthe other two must not be present; and if startpos, then init and stop cannot be specified. (There's a typo there thatsays "cannot combine with --start" when it really means "cannot combine with --stop", BTW). I think this would makemore sense to have init,start,stop be commands, in pg_ctl's spirit; so there would be no double-dash. IOW SOMEPATH/pg_receivellog--startpos=123 start and so on. Also, we need SGML docs for this new utility. Any particular reason for removing this line: -/* Get a new XLogReader */ +extern XLogReaderState *XLogReaderAllocate(XLogPageReadCB pagereadfunc, void *private_data); Typo here (2n*d*Quadrant): += Snapshot Building = +:author: Andres Freund, 2nQuadrant Ltd I don't see the point of XLogRecordBuffer.record_data; we already have a pointer to the XLogRecord, and the data can readily be obtained using XLogRecGetData. So why provide the same thing twice? It seems to me that if instead of passing the XLogRecordBuffer we just provide the XLogRecord, and separately the "origptr" where needed, we could avoid having to expose the XLogRecordBuffer stuff unnecessarily. In this comment: + * FIXME: We need something resembling the real SnapshotNow to handle things + * like enum lookups from indices correctly. what do we need consider in light of the new comment proposed by Robert CA+TgmobvTjRj_doXxQ0wgA1a1JLYPVYqtR3m+Cou_ousabnmXg@mail.gmail.com -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2013-08-27 11:32:30 -0400, Alvaro Herrera wrote: > Andres Freund wrote: > > The git tree is at: > > git://git.postgresql.org/git/users/andresfreund/postgres.git branch xlog-decoding-rebasing-cf4 > > http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-decoding-rebasing-cf4 > > I gave this recently rebased branch a skim. In general, the separation > between decode.c/reorderbuffer.c/snapbuild.c seems a lot nicer now than > on previous iterations -- good job there. Thanks for having a look! > Here are some quick notes I took while reading the patch itself. I > haven't gone through it really carefully, yet. > > > - I wonder whether DecodeCommit and DecodeAbort should really be a single > routine. Right now, the former might call the later; and the latter is > aware of this. Seems awkward. Yes, I am not happy with that either. I'll play with combining them and check whether that looks beter. > - We skip insert/update/delete if not my database Id; however, we don't skip > commit in the same case. If there are two walrecvrs on a cluster, on > different databases, does this lead to us trying to remove files > twice, if a xact commits which deleted some files? Is this a problem? > Should we try to skip such database-specific actions in global > WAL records? Hm. We should be able to skip it for long commit records at least. I think I lost that code along the unification. There's no danger of removing anything global afaics since we're not replaying using the original replay routines and all the slot/sender specific stuff has unique names. > - There's rmgr-specific knowledge in decode.c. I wonder if, similar to > redo and desc routines, that shouldn't instead be pluggable functions > for each rmgr. I don't think that's a good idea. I've quickly played with it before and it doesn't seem to end happy. It would require opening up more semi-public interfaces and in the end, we're only interested of in-core stuff. Even if it were possible to add new indexes by plugging new rmgrs, we wouldn't care. > - What's with ReorderBufferRestoreCleanup()? Shouldn't it be in logical.c? No, that's just for removing ondisk data at the end of a transaction. I'll improve the comment. > - reorderbuffer.c does several different things. Can it be split? > Perhaps in pieces such as > * stuff to manage memory (slab cache thingies) > * TXN iterator > * other logically separate parts? > * the rest Hm. I don't really see much point in splitting it along those lines. None of those really makes sense without the other parts and the file isn't *that* huge. > - Having to expose LocalExecuteInvalidationMessage() looks awkward. Is there > another way? Hm. I don't immediately see any way. We need to execute invalidation messages just within one backend. There just is no exposed functionality for that yet since it wasn't needed so far. We could expose something like LocalExecuteInvalidationMessage*s*() instead of doing the loop in reorderbuffer.c, but that's about it. > - I think we need a better name for "treat_as_catalog_table" (and > RelationIsTreatedAsCatalogTable). Maybe replication_catalog or > something similar? I think we're going to end up needing that for more than just replication, so I'd like to keep replication out of the name. I don't like the current name either though, so any other ideas? > - Don't do this: > + * RecentGlobal(Data)?Xmin is initialized to InvalidTransactionId, to ensure that no > because later greps for RecentGlobalDataXmin and RecentGlobalXmin will > fail to find it. It seems better to spell both names, so > "RecentGlobalDataXmin and RecentGlobalXmin are initialized to ..." Ok. > - the pg_receivellog command line is strange. Apparently I need one or > more of --start,--init,--stop, but if stop, then the other two must > not be present; and if startpos, then init and stop cannot be > specified. (There's a typo there that says "cannot combine with > --start" when it really means "cannot combine with --stop", BTW). I > think this would make more sense to have init,start,stop be commands, > in pg_ctl's spirit; so there would be no double-dash. IOW > SOMEPATH/pg_receivellog --startpos=123 start > and so on. The reasoning here is somewhat complex and I am not happy with the status quo, so I like getting input here. The individual verbs mean: * init: create a replication slot * start: continue streaming in an existing replication slot * stop: remove replication slot The reason you cannot specify anything with --stop is that a) --start streams until you abort the utility. So there's no chance of running --stop after it. b) --init and --stop seems like a pointless combination since you cannot actually do anything with the slot. --init and --start combined, on the other hand are useful for testing, which is why I allow them so far, but I wouldn't have problems removing that capability. The reason you cannot combine --init or --init --start with --startpos is that --startpos has to refer to a location that could have actually streamed to the client. Before a replication slot is established the client doesn't know anything about such an address, so --init --start cannot know any useful --startpos, that's why it's forbidden to pass one. The idea behind startpos is that you can tell the server "I have replayed transactions up to this LSN" and the server will only give you only transactions that have commited after this. > Also, we need SGML docs for this new utility. And a lot more than only for this utility :( > Any particular reason for removing this line: > -/* Get a new XLogReader */ > + > extern XLogReaderState *XLogReaderAllocate(XLogPageReadCB pagereadfunc, > void *private_data); Hrmpf. Merge error. I've integrated too many different versions of too different xlogreaders ;) > I don't see the point of XLogRecordBuffer.record_data; we already have a > pointer to the XLogRecord, and the data can readily be obtained using > XLogRecGetData. So why provide the same thing twice? It seems to me > that if instead of passing the XLogRecordBuffer we just provide the > XLogRecord, and separately the "origptr" where needed, we could avoid > having to expose the XLogRecordBuffer stuff unnecessarily. By now we also need the end location of a wal record. So we have to pass three addresses around for everything which isn't very convenient. If you vastly prefer passing around three parameters I can do that, but I'd rather not. The original reason for doing so was, to be honest, that my own xlogreader's API was different... > In this comment: > + * FIXME: We need something resembling the real SnapshotNow to handle things > + * like enum lookups from indices correctly. > what do we need consider in light of the new comment proposed by Robert > CA+TgmobvTjRj_doXxQ0wgA1a1JLYPVYqtR3m+Cou_ousabnmXg@mail.gmail.com I did most of the code changes for this, but this made me realize that there are quite some more comments and even a function name to be adapted. Will work on that. Thanks! Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, I've attached a couple of the preliminary patches to $subject which I've recently cleaned up in the hope that we can continue improving on those in a piecemal fashion. I am preparing submission of a newer version of the major patch but unfortunately progress on that is slower than I'd like... In the order of chance of applying them individuall they are: 0005 wal_decoding: Log xl_running_xact's at a higher frequency than checkpoints are done * benefits hot standby startup 0003 wal_decoding: Allow walsender's to connect to a specific database * biggest problem is how to specify the connection we connect to. Currently with the patch walsender connects to a database if it's not named "replication" (via dbname). Perhaps it's better to invent a replication_dbname parameter? 0006 wal_decoding: copydir: move fsync_fname to fd.[c.h] and make it public * Pretty trivial and boring. 0007 wal_decoding: Add information about a tables primary key to struct RelationData * Could be used in the matview refresh code 0002 wal_decoding: Introduce InvalidCommandId and declare that to be the new maximum for CommandCounterIncrement Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
- 0002-wal_decoding-Introduce-InvalidCommandId-and-declare-.patch
- 0003-wal_decoding-Allow-walsender-s-to-connect-to-a-speci.patch
- 0005-wal_decoding-Log-xl_running_xact-s-at-a-higher-frequ.patch
- 0006-wal_decoding-copydir-move-fsync_fname-to-fd.-c.h-and.patch
- 0007-wal_decoding-Add-information-about-a-tables-primary-.patch
On Fri, Aug 30, 2013 at 11:19 AM, Andres Freund <andres@2ndquadrant.com> wrote: > 0005 wal_decoding: Log xl_running_xact's at a higher frequency than checkpoints are done > * benefits hot standby startup Review: 1. I think more comments are needed here to explain why we need this. I don't know if the comments should go into the functions modified by this patch or in some other location, but I don't find what's here now adequate for understanding. 2. I think the variable naming could be better. If nothing else, I'd spell out "snapshot" rather than abbreviating it to "snap". I'd also add comments explaining what each of those variables does. And why isn't log_snap_interval_ms a #define rather than a variable? (Don't even talk to me about using gdb on a running instance. If you're even thinking about that, this needs to be a GUC.) 3. Why does LogCurrentRunningXacts() need to call XLogSetAsyncXactLSN()? Hopefully any WAL record is going to get sync'd in a reasonably timely fashion; I can't see off-hand why this one should need special handling. > 0003 wal_decoding: Allow walsender's to connect to a specific database > * biggest problem is how to specify the connection we connect > to. Currently with the patch walsender connects to a database if it's > not named "replication" (via dbname). Perhaps it's better to invent a > replication_dbname parameter? I understand why logical replication needs to connect to a database, but I don't understand why any other walsender would need to connect to a database. Absent a clear use case for such a thing, I don't think we should allow it. Ignorant suggestion: perhaps the database name could be stored in the logical replication slot. > 0006 wal_decoding: copydir: move fsync_fname to fd.[c.h] and make it public > * Pretty trivial and boring. Seems fine. > 0007 wal_decoding: Add information about a tables primary key to struct RelationData > * Could be used in the matview refresh code I think you and Kevin should discuss whether this is actually the right way to do this. ISTM that if logical replication and materialized views end up selecting different approaches to this problem, everybody loses. > 0002 wal_decoding: Introduce InvalidCommandId and declare that to be the new maximum for CommandCounterIncrement I'm still unconvinced we want this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-09-03 11:40:57 -0400, Robert Haas wrote: > On Fri, Aug 30, 2013 at 11:19 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > 0005 wal_decoding: Log xl_running_xact's at a higher frequency than checkpoints are done > > * benefits hot standby startup > > Review: > > 1. I think more comments are needed here to explain why we need this. > I don't know if the comments should go into the functions modified by > this patch or in some other location, but I don't find what's here now > adequate for understanding. Hm. What information are you actually missing? I guess the XLogSetAsyncXactLSN() needs a bit more context based on your question, what else? Not sure if it makes sense to explain in detail why it helps us to get into a consistent state faster? > 2. I think the variable naming could be better. If nothing else, I'd > spell out "snapshot" rather than abbreviating it to "snap". I'd also > add comments explaining what each of those variables does. Ok. > And why > isn't log_snap_interval_ms a #define rather than a variable? (Don't > even talk to me about using gdb on a running instance. If you're even > thinking about that, this needs to be a GUC.) Ugh. It certainly doesn't have anything to do with wanting to change it on a running system using gdb. Brrr. I think I wanted it to be a constant variable but forgot the const. I personally prefer 'static const' to #define's if its legal C, but I guess the project's style differs, so I'll change that. > 3. Why does LogCurrentRunningXacts() need to call > XLogSetAsyncXactLSN()? Hopefully any WAL record is going to get > sync'd in a reasonably timely fashion; I can't see off-hand why this > one should need special handling. No, we don't force writing out wal records in a timely fashion if there's no pressure in wal_buffers, basically only on commits and various XLogFlush()es. It doesn't make much of a difference if the entire system is busy, but if it's not the wal writer will sleep. The alternative would be to XLogFlush() the record, but that would actually block, which isn't really what we want/need. > > 0003 wal_decoding: Allow walsender's to connect to a specific database > > * biggest problem is how to specify the connection we connect > > to. Currently with the patch walsender connects to a database if it's > > not named "replication" (via dbname). Perhaps it's better to invent a > > replication_dbname parameter? > I understand why logical replication needs to connect to a database, > but I don't understand why any other walsender would need to connect > to a database. Well, logical replication actually streams out data using the walsender, so that's the reason why I want to add it there. But there have been cases in the past where we wanted to do stuff in the walsender that need database access, but we couldn't do so because you cannot connect to one. > Absent a clear use case for such a thing, I don't > think we should allow it. Ignorant suggestion: perhaps the database > name could be stored in the logical replication slot. The problem is that you need to InitPostgres() with a database. You cannot do that again, after connecting with an empty database which we do in a plain walsender. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Sep 3, 2013 at 12:05 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> 1. I think more comments are needed here to explain why we need this. >> I don't know if the comments should go into the functions modified by >> this patch or in some other location, but I don't find what's here now >> adequate for understanding. > > Hm. What information are you actually missing? I guess the > XLogSetAsyncXactLSN() needs a bit more context based on your question, > what else? > Not sure if it makes sense to explain in detail why it helps us to get > into a consistent state faster? Well, we must have had some idea in mind when the original Hot Standby patch went in that doing this once per checkpoint was good enough. Now we think we need it every 15 seconds, but not more or less often. So, why the change of heart? To my way of thinking, it seems as though we ought to always begin replay at a checkpoint, so the standby ought always to see one of these records immediately. Obviously that's not good enough, but why not? And why is every 15 seconds good enough? >> 3. Why does LogCurrentRunningXacts() need to call >> XLogSetAsyncXactLSN()? Hopefully any WAL record is going to get >> sync'd in a reasonably timely fashion; I can't see off-hand why this >> one should need special handling. > > No, we don't force writing out wal records in a timely fashion if > there's no pressure in wal_buffers, basically only on commits and > various XLogFlush()es. It doesn't make much of a difference if the > entire system is busy, but if it's not the wal writer will sleep. The > alternative would be to XLogFlush() the record, but that would actually > block, which isn't really what we want/need. The WAL writer is supposed to call XLogBackgroundFlush() every time WalWriterDelay expires. Yeah, it can hibernate, but if it's hibernating, then we should respect that decision for this WAL record type also. >> > 0003 wal_decoding: Allow walsender's to connect to a specific database >> > * biggest problem is how to specify the connection we connect >> > to. Currently with the patch walsender connects to a database if it's >> > not named "replication" (via dbname). Perhaps it's better to invent a >> > replication_dbname parameter? > >> I understand why logical replication needs to connect to a database, >> but I don't understand why any other walsender would need to connect >> to a database. > > Well, logical replication actually streams out data using the walsender, > so that's the reason why I want to add it there. But there have been > cases in the past where we wanted to do stuff in the walsender that need > database access, but we couldn't do so because you cannot connect to > one. Could you be more specific? >> Absent a clear use case for such a thing, I don't >> think we should allow it. Ignorant suggestion: perhaps the database >> name could be stored in the logical replication slot. > > The problem is that you need to InitPostgres() with a database. You > cannot do that again, after connecting with an empty database which we > do in a plain walsender. Are you saying that the logical replication slot can't be read before calling InitPostgres()? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-09-03 12:22:22 -0400, Robert Haas wrote: > On Tue, Sep 3, 2013 at 12:05 PM, Andres Freund <andres@2ndquadrant.com> wrote: > >> 1. I think more comments are needed here to explain why we need this. > >> I don't know if the comments should go into the functions modified by > >> this patch or in some other location, but I don't find what's here now > >> adequate for understanding. > > > > Hm. What information are you actually missing? I guess the > > XLogSetAsyncXactLSN() needs a bit more context based on your question, > > what else? > > Not sure if it makes sense to explain in detail why it helps us to get > > into a consistent state faster? > Well, we must have had some idea in mind when the original Hot Standby > patch went in that doing this once per checkpoint was good enough. > Now we think we need it every 15 seconds, but not more or less often. > So, why the change of heart? I think the primary reason for that was that it's was a pretty complicated patchset and we needed to start somewhere. By now we do have reports of standbys taking their time to get consistent. > To my way of thinking, it seems as though we ought to always begin > replay at a checkpoint, so the standby ought always to see one of > these records immediately. Obviously that's not good enough, but why > not? We always see one after the checkpoint (well, actually before the checkpoint record, but ...), correct. The problem is just that reading a single xact_running record doesn't automatically make you consistent. If there's a single suboverflowed transaction running on the primary when the xl_runing_xacts is logged we won't be able to switch to consistent. Check procarray.c:ProcArrayApplyRecoveryInfo() for some fun and some optimizations. Since the only place where we currently have the information to potentially become consistent is ProcArrayApplyRecoveryInfo() we will have to wait checkpoint_timeout time till we get consistent. Which sucks as there are good arguments to set that to 1h. That especially sucks as you loose consistency everytime you restart the standby... > And why is every 15 seconds good enough? Waiting 15s to become consistent instead of checkpoint_timeout seems to be ok to me and to be a good tradeoff between overhead and waiting. We can certainly discuss other values or making it configurable. The latter seemed to be unnecessary to me, but I have don't have a problem implementing it. I just don't want to document it :P > >> 3. Why does LogCurrentRunningXacts() need to call > >> XLogSetAsyncXactLSN()? Hopefully any WAL record is going to get > >> sync'd in a reasonably timely fashion; I can't see off-hand why this > >> one should need special handling. > > > > No, we don't force writing out wal records in a timely fashion if > > there's no pressure in wal_buffers, basically only on commits and > > various XLogFlush()es. It doesn't make much of a difference if the > > entire system is busy, but if it's not the wal writer will sleep. The > > alternative would be to XLogFlush() the record, but that would actually > > block, which isn't really what we want/need. > The WAL writer is supposed to call XLogBackgroundFlush() every time > WalWriterDelay expires. Yeah, it can hibernate, but if it's > hibernating, then we should respect that decision for this WAL record > type also. Why should we respect it? There is work to be done and the wal writer has no way of knowing that without us telling it? Normally we rely on commit records and XLogFlush()es to trigger the wal writer. Alternatively we can start a transaction and set synchronous_commit = off, but that seems like a complication to me. > >> I understand why logical replication needs to connect to a database, > >> but I don't understand why any other walsender would need to connect > >> to a database. > > > > Well, logical replication actually streams out data using the walsender, > > so that's the reason why I want to add it there. But there have been > > cases in the past where we wanted to do stuff in the walsender that need > > database access, but we couldn't do so because you cannot connect to > > one. > Could you be more specific? I only remember 3959.1349384333@sss.pgh.pa.us but I think it has come up before. > >> Absent a clear use case for such a thing, I don't > >> think we should allow it. Ignorant suggestion: perhaps the database > >> name could be stored in the logical replication slot. > > > > The problem is that you need to InitPostgres() with a database. You > > cannot do that again, after connecting with an empty database which we > > do in a plain walsender. > > Are you saying that the logical replication slot can't be read before > calling InitPostgres()? The slot can be read just fine, but we won't know that we should do that. Walsender accepts commands via PostgresMain()'s mainloop which has done a InitPostgres(dbname) before. Which we need to do because we need the environment it sets up. The database is stored in the slots btw (as oid, not as a name though) ;) Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Sep 3, 2013 at 12:57 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> To my way of thinking, it seems as though we ought to always begin >> replay at a checkpoint, so the standby ought always to see one of >> these records immediately. Obviously that's not good enough, but why >> not? > > We always see one after the checkpoint (well, actually before the > checkpoint record, but ...), correct. The problem is just that reading a > single xact_running record doesn't automatically make you consistent. If > there's a single suboverflowed transaction running on the primary when > the xl_runing_xacts is logged we won't be able to switch to > consistent. Check procarray.c:ProcArrayApplyRecoveryInfo() for some fun > and some optimizations. > Since the only place where we currently have the information to > potentially become consistent is ProcArrayApplyRecoveryInfo() we will > have to wait checkpoint_timeout time till we get consistent. Which > sucks as there are good arguments to set that to 1h. > That especially sucks as you loose consistency everytime you restart the > standby... Right, OK. >> And why is every 15 seconds good enough? > > Waiting 15s to become consistent instead of checkpoint_timeout seems to > be ok to me and to be a good tradeoff between overhead and waiting. We > can certainly discuss other values or making it configurable. The latter > seemed to be unnecessary to me, but I have don't have a problem > implementing it. I just don't want to document it :P I don't think it particularly needs to be configurable, but I wonder if we can't be a bit smarter about when we do it. For example, suppose we logged it every 15 s but only until we log a non-overflowed snapshot. I realize that the overhead of a WAL record every 15 seconds is fairly small, but the load on some systems is all but nonexistent. It would be nice not to wake up the HD unnecessarily. >> The WAL writer is supposed to call XLogBackgroundFlush() every time >> WalWriterDelay expires. Yeah, it can hibernate, but if it's >> hibernating, then we should respect that decision for this WAL record >> type also. > > Why should we respect it? Because I don't see any reason to believe that this WAL record is any more important or urgent than any other WAL record that might get logged. >> >> I understand why logical replication needs to connect to a database, >> >> but I don't understand why any other walsender would need to connect >> >> to a database. >> > >> > Well, logical replication actually streams out data using the walsender, >> > so that's the reason why I want to add it there. But there have been >> > cases in the past where we wanted to do stuff in the walsender that need >> > database access, but we couldn't do so because you cannot connect to >> > one. > >> Could you be more specific? > > I only remember 3959.1349384333@sss.pgh.pa.us but I think it has come up > before. It seems we need some more design there. Perhaps entering replication mode could be triggered by writing either dbname=replication or replication=yes. But then, do the replication commands simply become SQL commands? I've certainly seen hackers use them that way. And I can imagine that being a sensible approach, but this patch seems like it's only covering a fairly small fraction of what really ought to be a single commit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-09-03 15:56:15 -0400, Robert Haas wrote: > On Tue, Sep 3, 2013 at 12:57 PM, Andres Freund <andres@2ndquadrant.com> wrote: > >> And why is every 15 seconds good enough? > > > > Waiting 15s to become consistent instead of checkpoint_timeout seems to > > be ok to me and to be a good tradeoff between overhead and waiting. We > > can certainly discuss other values or making it configurable. The latter > > seemed to be unnecessary to me, but I have don't have a problem > > implementing it. I just don't want to document it :P > > I don't think it particularly needs to be configurable, but I wonder > if we can't be a bit smarter about when we do it. For example, > suppose we logged it every 15 s but only until we log a non-overflowed > snapshot. There's actually more benefits than just overflowed snapshots (pruning of the known xids machinery, exclusive lock cleanup). > I realize that the overhead of a WAL record every 15 > seconds is fairly small, but the load on some systems is all but > nonexistent. It would be nice not to wake up the HD unnecessarily. The patch as-is only writes if there has been WAL written since the last time it logged a running_xacts. I think it's not worth building more smarts than that? > >> The WAL writer is supposed to call XLogBackgroundFlush() every time > >> WalWriterDelay expires. Yeah, it can hibernate, but if it's > >> hibernating, then we should respect that decision for this WAL record > >> type also. > > > > Why should we respect it? > > Because I don't see any reason to believe that this WAL record is any > more important or urgent than any other WAL record that might get > logged. I can't follow the logic behind that statement. Just about all WAL records are either pretty immediately flushed afterwards or are done in the context of a transaction where we flush (or do a XLogSetAsyncXactLSN) at transaction commit. XLogBackgroundFlush() won't necessarily flush the running_xacts record. Unless you've set the async xact lsn it will only flush complete blocks. So what can happen (I've seen that more than once in testing, took me a while to debug) that a checkpoint is started in a busy period but nothing happens after it finished. Since the checkpoint triggered running_xact is triggered *before* we do the smgr flush it still is overflowed. Then, after activity has died down, the bgwriter issues the running xact record, but it's filling a block and thus it never get's flushed. To me the alternatives are to do a XLogSetAsyncXactLSN() or an XLogFlush(). The latter is more aggressive and can block for a measurable amount of time, which is why I don't want to do it in the bgwriter. > >> >> I understand why logical replication needs to connect to a database, > >> >> but I don't understand why any other walsender would need to connect > >> >> to a database. > >> > > >> > Well, logical replication actually streams out data using the walsender, > >> > so that's the reason why I want to add it there. But there have been > >> > cases in the past where we wanted to do stuff in the walsender that need > >> > database access, but we couldn't do so because you cannot connect to > >> > one. > > > >> Could you be more specific? > > > > I only remember 3959.1349384333@sss.pgh.pa.us but I think it has come up > > before. > > It seems we need some more design there. Perhaps entering replication > mode could be triggered by writing either dbname=replication or > replication=yes. But then, do the replication commands simply become > SQL commands? I've certainly seen hackers use them that way. And I > can imagine that being a sensible approach, but this patch seems like > it's only covering a fairly small fraction of what really ought to be > a single commit. Yes. I think you're right that we need more input/design here. I've previously started threads about it, but nobody replied :(. The problem with using dbname=replication as a trigger for anything is that we actually allow databases to be created with that name. Perhaps that was a design mistake. I wondered about turning replication from a boolean into something like off|0, on|1, database. dbname= gets only used in the latter variant. That would be compatible with previous versions and would even support using old tools (since all of them seem to do replication=1). > But then, do the replication commands simply become > SQL commands? I've certainly seen hackers use them that way. I don't think that it's a good way at this point to make them to plain SQL. There is more infrastructure (signal handlers, permissions, different timeouts) & memory required for walsenders, so using plain SQL there seems beyond the scope of this. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Sep 3, 2013 at 7:10 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> I don't think it particularly needs to be configurable, but I wonder >> if we can't be a bit smarter about when we do it. For example, >> suppose we logged it every 15 s but only until we log a non-overflowed >> snapshot. > > There's actually more benefits than just overflowed snapshots (pruning > of the known xids machinery, exclusive lock cleanup). I know that, but I thought the master and slave could only lose sync on those things after a master crash and that once per checkpoint cycle was enough for those other benefits. Am I wrong? > The patch as-is only writes if there has been WAL written since the last > time it logged a running_xacts. I think it's not worth building more > smarts than that? Hmm, maybe. >> Because I don't see any reason to believe that this WAL record is any >> more important or urgent than any other WAL record that might get >> logged. > > I can't follow the logic behind that statement. Just about all WAL > records are either pretty immediately flushed afterwards or are done in > the context of a transaction where we flush (or do a > XLogSetAsyncXactLSN) at transaction commit. > > XLogBackgroundFlush() won't necessarily flush the running_xacts > record. OK, this was the key point I was missing. >> It seems we need some more design there. Perhaps entering replication >> mode could be triggered by writing either dbname=replication or >> replication=yes. But then, do the replication commands simply become >> SQL commands? I've certainly seen hackers use them that way. And I >> can imagine that being a sensible approach, but this patch seems like >> it's only covering a fairly small fraction of what really ought to be >> a single commit. > > Yes. I think you're right that we need more input/design here. I've > previously started threads about it, but nobody replied :(. > > The problem with using dbname=replication as a trigger for anything is > that we actually allow databases to be created with that name. Perhaps > that was a design mistake. It seemed like a good idea at the time, but maybe it wasn't. I'm not sure where to go with it at this point; a forcible backward compatibility break would probably screw things up for a lot of people. > I wondered about turning replication from a boolean into something like > off|0, on|1, database. dbname= gets only used in the latter > variant. That would be compatible with previous versions and would even > support using old tools (since all of them seem to do replication=1). I don't love that, but I don't hate it, either. But it still doesn't answer the following question, which I think is important: if I (or someone else) commits this patch, how will that make things better for users? At the moment it's just adding a knob that doesn't do anything for you when you twist it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-09-04 10:02:05 -0400, Robert Haas wrote: > On Tue, Sep 3, 2013 at 7:10 PM, Andres Freund <andres@2ndquadrant.com> wrote: > >> I don't think it particularly needs to be configurable, but I wonder > >> if we can't be a bit smarter about when we do it. For example, > >> suppose we logged it every 15 s but only until we log a non-overflowed > >> snapshot. > > > > There's actually more benefits than just overflowed snapshots (pruning > > of the known xids machinery, exclusive lock cleanup). > I know that, but I thought the master and slave could only lose sync > on those things after a master crash and that once per checkpoint > cycle was enough for those other benefits. Am I wrong? The xid tracking can keep track without the additional records but it sometimes needs a good bit more memory to do so if the primary burns to xids quite fast. Everytime we see an running xacts record we can do cleanup (that's the ExpireOldKnownAssignedTransactionIds() in ProcArrayApplyRecoveryInfo()). > > The problem with using dbname=replication as a trigger for anything is > > that we actually allow databases to be created with that name. Perhaps > > that was a design mistake. > > It seemed like a good idea at the time, but maybe it wasn't. I'm not > sure where to go with it at this point; a forcible backward > compatibility break would probably screw things up for a lot of > people. Yes, breaking things now doesn't seem like a good idea. > > I wondered about turning replication from a boolean into something like > > off|0, on|1, database. dbname= gets only used in the latter > > variant. That would be compatible with previous versions and would even > > support using old tools (since all of them seem to do replication=1). > > I don't love that, but I don't hate it, either. Ok. Will update the patch that way. Seems better than it's current state. > But it still doesn't > answer the following question, which I think is important: if I (or > someone else) commits this patch, how will that make things better for > users? At the moment it's just adding a knob that doesn't do anything > for you when you twist it. I am not sure it's a good idea to commit it before we're sure were going to commit the changeset extraction. It's an independently reviewable and testable piece of code that's simple enough to understand quickly in contrast to the large changeset extraction patch. That's why I kept it separate. On the other hand, as you know, it's not without precedent to commit pieces of infrastructure that aren't really useful for the enduser (think FDW). Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-09-03 11:40:57 -0400, Robert Haas wrote: > > 0002 wal_decoding: Introduce InvalidCommandId and declare that to be the new maximum for CommandCounterIncrement > > I'm still unconvinced we want this. Ok, so the reason for the existance of this patch is that currently there is no way to represent a "unset" CommandId. This is a problem for the following patches because we need to log the cmin, cmax of catalog rows and obviously there can be rows where cmax is unset. The reason I chose to change the definition of CommandIds is that the other ondisk types we use like TransactionIds, XLogRecPtrs and such have an "invalid" type, CommandIds don't. Changing their definition to have 0 - analogous to the previous examples - as their invalid value is not a problem because CommandIds from pg_upgraded clusters may never be used for anything. Going from 2^32 to 2^32-1 possible CommandIds doesn't seem like a problem to me. Imo the CommandIds should have been defined that way from the start. Makes some sense? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Sep 4, 2013 at 12:07 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-09-03 11:40:57 -0400, Robert Haas wrote: >> > 0002 wal_decoding: Introduce InvalidCommandId and declare that to be the new maximum for CommandCounterIncrement >> >> I'm still unconvinced we want this. > > Ok, so the reason for the existance of this patch is that currently > there is no way to represent a "unset" CommandId. This is a problem for > the following patches because we need to log the cmin, cmax of catalog > rows and obviously there can be rows where cmax is unset. For heap tuples, we solve this problem by using flag bits. Why not adopt the same approach? > The reason I chose to change the definition of CommandIds is that the > other ondisk types we use like TransactionIds, XLogRecPtrs and such have > an "invalid" type, CommandIds don't. Changing their definition to have 0 > - analogous to the previous examples - as their invalid value is not a > problem because CommandIds from pg_upgraded clusters may never be used > for anything. Going from 2^32 to 2^32-1 possible CommandIds doesn't seem > like a problem to me. Imo the CommandIds should have been defined that > way from the start. > > Makes some sense? I don't have a problem with this if other people think it's a good idea. But I think it needs a few +1s and not too many -1s first, and so far (AFAIK) no one else has weighed in with an opinion. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2013-09-03 11:40:57 -0400, Robert Haas wrote: > On Fri, Aug 30, 2013 at 11:19 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > 0005 wal_decoding: Log xl_running_xact's at a higher frequency than checkpoints are done > > * benefits hot standby startup I tried to update the patch to address the comments you made. > > 0003 wal_decoding: Allow walsender's to connect to a specific database > > * biggest problem is how to specify the connection we connect > > to. Currently with the patch walsender connects to a database if it's > > not named "replication" (via dbname). Perhaps it's better to invent a > > replication_dbname parameter? I've updated the patch so it extends the "replication" startup parameter to not only specify a boolean but also "database". In the latter case it will connect to the database specified in "dbname". As discussed downthread, this patch doesn't have an immediate advantage for users until the changeset extraction patch itself is applied. Whether or not it should be applied separately is unclear. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 2013-09-05 12:44:18 -0400, Robert Haas wrote: > On Wed, Sep 4, 2013 at 12:07 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > On 2013-09-03 11:40:57 -0400, Robert Haas wrote: > >> > 0002 wal_decoding: Introduce InvalidCommandId and declare that to be the new maximum for CommandCounterIncrement > >> > >> I'm still unconvinced we want this. > > > > Ok, so the reason for the existance of this patch is that currently > > there is no way to represent a "unset" CommandId. This is a problem for > > the following patches because we need to log the cmin, cmax of catalog > > rows and obviously there can be rows where cmax is unset. > > For heap tuples, we solve this problem by using flag bits. Why not > adopt the same approach? We can, while it makes the amount of data stored/logged slightly larger and it seems to lead to less idiomatic code to me, so if there's another -1 I'll go that way. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi Kevin, On 2013-09-03 11:40:57 -0400, Robert Haas wrote: > On Fri, Aug 30, 2013 at 11:19 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > 0007 wal_decoding: Add information about a tables primary key to struct RelationData > > * Could be used in the matview refresh code > I think you and Kevin should discuss whether this is actually the > right way to do this. ISTM that if logical replication and > materialized views end up selecting different approaches to this > problem, everybody loses. The patch we're discussion here adds a new struct RelationData field called 'rd_primary' (should possibly be renamed) which contains information about the "best" candidate key available for a table. >From the header comments: /* * The 'best' primary or candidate key that has been found, only set * correctly if RelationGetIndexList has been called/rd_indexvalid > 0. * * Indexes are chosen in the following order: * * Primary Key * * oid index * * the first (OID order) unique, immediate, non-partial and * non-expression index over one or more NOT NULL'ed columns */ Oid rd_primary; I thought we could use that in matview.c:refresh_by_match_merge() to select a more efficient diff if rd_primary has a valid index. In that case you only'd need to compare that index's fields which should result in an more efficient plan. Maybe it's also useful in other cases for you? If it's relevant at all, would you like to have a different priority list than the one above? Regards, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Thu, Sep 5, 2013 at 12:59 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-09-05 12:44:18 -0400, Robert Haas wrote: >> On Wed, Sep 4, 2013 at 12:07 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> > On 2013-09-03 11:40:57 -0400, Robert Haas wrote: >> >> > 0002 wal_decoding: Introduce InvalidCommandId and declare that to be the new maximum for CommandCounterIncrement >> >> >> >> I'm still unconvinced we want this. >> > >> > Ok, so the reason for the existance of this patch is that currently >> > there is no way to represent a "unset" CommandId. This is a problem for >> > the following patches because we need to log the cmin, cmax of catalog >> > rows and obviously there can be rows where cmax is unset. >> >> For heap tuples, we solve this problem by using flag bits. Why not >> adopt the same approach? > > We can, while it makes the amount of data stored/logged slightly larger > and it seems to lead to less idiomatic code to me, so if there's another > -1 I'll go that way. OK. Consider me more of a -0 than a -1. Like I say, I don't really want to block it; I just don't feel comfortable committing it unless a few other people say something like "I don't see a problem with that".Or maybe point me to relevant changeset extractioncode that's going to get messier. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > OK. Consider me more of a -0 than a -1. Like I say, I don't really > want to block it; I just don't feel comfortable committing it unless a > few other people say something like "I don't see a problem with that". FWIW, I've always thought it was a wart that there wasn't a recognized InvalidCommandId value. It was never pressing to fix it before, but if LCR needs it, let's do so. I definitely *don't* find it cleaner to eat up another flag bit to avoid that. We don't have many to spare. Ideally I'd have made InvalidCommandId = 0 and FirstCommandId = 1, but I suppose we can't have that without an on-disk compatibility break. regards, tom lane
Hi, Thanks for weighin in. On 2013-09-05 14:21:33 -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > OK. Consider me more of a -0 than a -1. Like I say, I don't really > > want to block it; I just don't feel comfortable committing it unless a > > few other people say something like "I don't see a problem with that". > > FWIW, I've always thought it was a wart that there wasn't a recognized > InvalidCommandId value. It was never pressing to fix it before, but > if LCR needs it, let's do so. Yes, its a bit anomalous to the other types. > I definitely *don't* find it cleaner to > eat up another flag bit to avoid that. We don't have many to spare. It wouldn't need to be a flag bit in any existing struct, so that's not a problem. > Ideally I'd have made InvalidCommandId = 0 and FirstCommandId = 1, > but I suppose we can't have that without an on-disk compatibility break. The patch actually does change it exactly that way. My argument for that being valid is that CommandIds don't play any role outside of their own transaction. Now, somebody could argue that SELECT cmin, cmax can be done outside the transaction, but: Those values are already pretty much meaningless today since cmin/cmax have been merged. They also don't check whether the field is initialized at all. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2013-09-05 14:21:33 -0400, Tom Lane wrote: >> Ideally I'd have made InvalidCommandId = 0 and FirstCommandId = 1, >> but I suppose we can't have that without an on-disk compatibility break. > The patch actually does change it exactly that way. Oh. I hadn't looked at the patch, but I had (mis)read what Robert said to think that you were proposing introducing InvalidCommandId = 0xFFFFFFFF while leaving FirstCommandId alone. That would make more sense to me as (1) it doesn't change the interpretation of anything that's (likely to be) on disk; (2) it allows the check for overflow in CommandCounterIncrement to not involve recovering from an *actual* overflow. With the horsing around we've been seeing from the gcc boys lately, I don't have a warm feeling about whether they won't break that test someday on the grounds that "overflow is undefined behavior". regards, tom lane
On Thu, Sep 5, 2013 at 11:30 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> Ideally I'd have made InvalidCommandId = 0 and FirstCommandId = 1, >> but I suppose we can't have that without an on-disk compatibility break. > > The patch actually does change it exactly that way. My argument for that > being valid is that CommandIds don't play any role outside of their own > transaction. Right. It seems like this should probably be noted in the documentation under "5.4. System Columns" -- I just realized that it isn't. -- Peter Geoghegan
On 2013-09-05 14:37:01 -0400, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2013-09-05 14:21:33 -0400, Tom Lane wrote: > >> Ideally I'd have made InvalidCommandId = 0 and FirstCommandId = 1, > >> but I suppose we can't have that without an on-disk compatibility break. > > > The patch actually does change it exactly that way. > > Oh. I hadn't looked at the patch, but I had (mis)read what Robert said > to think that you were proposing introducing InvalidCommandId = 0xFFFFFFFF > while leaving FirstCommandId alone. That would make more sense to me as > (1) it doesn't change the interpretation of anything that's (likely to be) > on disk; (2) it allows the check for overflow in CommandCounterIncrement > to not involve recovering from an *actual* overflow. With the horsing > around we've been seeing from the gcc boys lately Ok, I can do it that way. LCR obviously shouldn't care. > I don't have a warm > feeling about whether they won't break that test someday on the grounds > that "overflow is undefined behavior". Unsigned overflow is pretty strictly defined, so I don't see much danger there. Also, we'd feel the pain pretty definitely with xids... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-09-05 21:02:44 +0200, Andres Freund wrote: > On 2013-09-05 14:37:01 -0400, Tom Lane wrote: > > Andres Freund <andres@2ndquadrant.com> writes: > > > On 2013-09-05 14:21:33 -0400, Tom Lane wrote: > > >> Ideally I'd have made InvalidCommandId = 0 and FirstCommandId = 1, > > >> but I suppose we can't have that without an on-disk compatibility break. > > > > > The patch actually does change it exactly that way. > > > > Oh. I hadn't looked at the patch, but I had (mis)read what Robert said > > to think that you were proposing introducing InvalidCommandId = 0xFFFFFFFF > > while leaving FirstCommandId alone. That would make more sense to me as > > (1) it doesn't change the interpretation of anything that's (likely to be) > > on disk; (2) it allows the check for overflow in CommandCounterIncrement > > to not involve recovering from an *actual* overflow. With the horsing > > around we've been seeing from the gcc boys lately > > Ok, I can do it that way. LCR obviously shouldn't care. It doesn't care to the point that the patch already does exactly what you propose. It's just my memory that remembered things differently. So, a very slightly updated patch attached. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Thu, Sep 5, 2013 at 3:23 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> > Oh. I hadn't looked at the patch, but I had (mis)read what Robert said >> > to think that you were proposing introducing InvalidCommandId = 0xFFFFFFFF >> > while leaving FirstCommandId alone. That would make more sense to me as >> > (1) it doesn't change the interpretation of anything that's (likely to be) >> > on disk; (2) it allows the check for overflow in CommandCounterIncrement >> > to not involve recovering from an *actual* overflow. With the horsing >> > around we've been seeing from the gcc boys lately >> >> Ok, I can do it that way. LCR obviously shouldn't care. > > It doesn't care to the point that the patch already does exactly what > you propose. It's just my memory that remembered things differently. > > So, a very slightly updated patch attached. Committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Sep 5, 2013 at 3:23 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> So, a very slightly updated patch attached. > Committed. Hmm ... shouldn't this patch adjust the error messages in CommandCounterIncrement? We just took away one possible command. It's pretty nitpicky, especially since many utility commands do more than one CommandCounterIncrement, but still ... regards, tom lane
On 2013-09-09 18:43:51 -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Thu, Sep 5, 2013 at 3:23 PM, Andres Freund <andres@2ndquadrant.com> wrote: > >> So, a very slightly updated patch attached. > > > Committed. > > Hmm ... shouldn't this patch adjust the error messages in > CommandCounterIncrement? We just took away one possible command. > It's pretty nitpicky, especially since many utility commands do > more than one CommandCounterIncrement, but still ... Hm. You're talking about "cannot have more than 2^32-2 commands in a transaction"? If so, the patch and the commit seem to have adjusted that? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2013-09-09 18:43:51 -0400, Tom Lane wrote: >> Hmm ... shouldn't this patch adjust the error messages in >> CommandCounterIncrement? > Hm. You're talking about "cannot have more than 2^32-2 commands in a > transaction"? If so, the patch and the commit seem to have adjusted that? Oh! That's what I get for going on memory instead of re-reading the commit. Sorry, never mind the noise. regards, tom lane
Andres Freund <andres@2ndquadrant.com> wrote: > Robert Haas wrote: >> Andres Freund <andres@2ndquadrant.com> wrote: >>> 0007 wal_decoding: Add information about a tables primary key to >>> struct RelationData >>> * Could be used in the matview refresh code > >> I think you and Kevin should discuss whether this is actually the >> right way to do this. ISTM that if logical replication and >> materialized views end up selecting different approaches to this >> problem, everybody loses. > > The patch we're discussion here adds a new struct RelationData field > called 'rd_primary' (should possibly be renamed) which contains > information about the "best" candidate key available for a table. > > From the header comments: > /* > * The 'best' primary or candidate key that has been found, only set > * correctly if RelationGetIndexList has been called/rd_indexvalid > 0. > * > * Indexes are chosen in the following order: > * * Primary Key > * * oid index > * * the first (OID order) unique, immediate, non-partial and > * non-expression index over one or more NOT NULL'ed columns > */ > Oid rd_primary; > > I thought we could use that in matview.c:refresh_by_match_merge() to > select a more efficient diff if rd_primary has a valid index. In that > case you only'd need to compare that index's fields which should result > in an more efficient plan. > > Maybe it's also useful in other cases for you? > > If it's relevant at all, would you like to have a different priority > list than the one above? My first thought was that it was necessary to use all unique, immediate, non-partial, non-expression indexes to avoid getting errors on the UPDATE phase of the concurrent refresh for transient duplicates; but then I remembered that I had to give up on that and do it all with DELETE followed by INSERT, which eliminates that risk. As things now stand the *existence* of any unique, non-partial, non-expression index (note that immediate is not needed) is sufficient for correctness. We could now even drop that, I think, if we added a duplicate check at the end in the absence of such an index. The reason I left it comparing columns from *all* such indexes is that it gives the optimizer the chance to pick the one that looks fastest. With the upcoming patch that can add some extra "equality" comparisons in addition to the "identical" comparisons the patch uses, so the mechanism you propose might be a worthwhile optimization for some cases as long as it does a good job of picking *the fastest* such index. The above method of choosing an index doesn't seem to necessarily ensure that. Also, if you need to include the "immediate" test, it could not be used for RMVC without "fallback" code if this mechanism didn't find an appropriate index. Of course, that would satisfy those who would like to relax the requirement for a unique index on the MV to be able to use RMVC. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company