Thread: Subscription test 013_partition.pl fails under CLOBBER_CACHE_ALWAYS
In connection with a nearby thread, I tried to run the subscription test suite in a CLOBBER_CACHE_ALWAYS build. I soon found that I had to increase wal_receiver_timeout, but after doing this: diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm index 1488bff..5fe6810 100644 *** a/src/test/perl/PostgresNode.pm --- b/src/test/perl/PostgresNode.pm *************** sub init *** 447,452 **** --- 447,453 ---- print $conf "log_statement = all\n"; print $conf "log_replication_commands = on\n"; print $conf "wal_retrieve_retry_interval = '500ms'\n"; + print $conf "wal_receiver_timeout = '10min'\n"; # If a setting tends to affect whether tests pass or fail, print it after # TEMP_CONFIG. Otherwise, print it before TEMP_CONFIG, thereby permitting I let it run overnight, and came back to find that it was stuck at [03:02:15] t/013_partition.pl ................. 19/51 and had been for circa eight hours, where extrapolation from other tests said it shouldn't take much over half an hour. Investigation found that the subscriber was repeatedly failing like this: 2020-09-14 11:05:26.483 EDT [1030506] LOG: logical replication apply worker for subscription "sub1" has started 2020-09-14 11:05:27.139 EDT [1030506] ERROR: cache lookup failed for relation 0 2020-09-14 11:05:27.140 EDT [947156] LOG: background worker "logical replication worker" (PID 1030506) exited with exitcode 1 2020-09-14 11:05:27.571 EDT [1030509] LOG: logical replication apply worker for subscription "sub1" has started 2020-09-14 11:05:28.227 EDT [1030509] ERROR: cache lookup failed for relation 0 2020-09-14 11:05:28.228 EDT [947156] LOG: background worker "logical replication worker" (PID 1030509) exited with exitcode 1 The publisher's log shows no sign of distress: 2020-09-14 11:06:09.380 EDT [1030619] sub1 LOG: statement: SELECT pg_catalog.set_config('search_path', '', false); 2020-09-14 11:06:09.446 EDT [1030619] sub1 LOG: received replication command: IDENTIFY_SYSTEM 2020-09-14 11:06:09.446 EDT [1030619] sub1 LOG: received replication command: START_REPLICATION SLOT "sub1" LOGICAL 0/163CF08(proto_version '2', publication_names '"pub1"') 2020-09-14 11:06:09.447 EDT [1030619] sub1 LOG: starting logical decoding for slot "sub1" 2020-09-14 11:06:09.447 EDT [1030619] sub1 DETAIL: Streaming transactions committing after 0/163D848, reading WAL from 0/163CF08. 2020-09-14 11:06:09.447 EDT [1030619] sub1 LOG: logical decoding found consistent point at 0/163CF08 2020-09-14 11:06:09.447 EDT [1030619] sub1 DETAIL: There are no running transactions. 2020-09-14 11:06:10.468 EDT [1030621] sub1 LOG: statement: SELECT pg_catalog.set_config('search_path', '', false); 2020-09-14 11:06:10.533 EDT [1030621] sub1 LOG: received replication command: IDENTIFY_SYSTEM 2020-09-14 11:06:10.534 EDT [1030621] sub1 LOG: received replication command: START_REPLICATION SLOT "sub1" LOGICAL 0/163CF08(proto_version '2', publication_names '"pub1"') 2020-09-14 11:06:10.534 EDT [1030621] sub1 LOG: starting logical decoding for slot "sub1" 2020-09-14 11:06:10.534 EDT [1030621] sub1 DETAIL: Streaming transactions committing after 0/163D848, reading WAL from 0/163CF08. 2020-09-14 11:06:10.534 EDT [1030621] sub1 LOG: logical decoding found consistent point at 0/163CF08 2020-09-14 11:06:10.534 EDT [1030621] sub1 DETAIL: There are no running transactions. I do not have time today to chase this further, but somebody should. More generally, this seems like good evidence that we really oughta have a buildfarm member that's running *all* the tests under CLOBBER_CACHE_ALWAYS not just the core tests. That seems impossibly expensive, but I realized while watching these tests that a ridiculous fraction of the runtime is being spent in repeated initdb calls. On my machine, initdb takes about 12 minutes under CCA, so doing it separately for publisher and subscriber means 24 minutes, which compares not at all favorably to the circa-half-an-hour total runtime of each of the subscription test scripts. We're surely not learning anything after the first CCA run of initdb, so if we could find a way to skip that overhead for later runs, it'd make a huge difference in the practicality of running these tests under CCA. I recall having worked on a patch to make the regression tests run initdb just once, creating a template directory tree, and then "cp -a" that into place for each test. I did not finish it, because it wasn't showing a lot of advantage in a normal test run, but maybe the idea could be resurrected for CCA and other slow builds. Another idea is to make CCA a little more dynamic, say allow it to be suppressed through an environment variable setting, and then use that to speed up per-test initdbs. regards, tom lane
Hi Tom,
I have tested the subscription test 013_partition.pl with CCA enabled on HEAD and PG13,
and I am able to reproduce the issue on both the versions.
Logs:
[centos@clobber-cache subscription]$ git branch
* REL_13_STABLE
master
[centos@clobber-cache-db93 subscription]$ tail -f tmp_check/log/013_partition_subscriber1.log
2020-09-15 08:42:19.763 UTC [27866] LOG: logical replication apply worker for subscription "sub1" has started
2020-09-15 08:42:20.395 UTC [27866] ERROR: cache lookup failed for relation 0
2020-09-15 08:42:20.436 UTC [26427] LOG: background worker "logical replication worker" (PID 27866) exited with exit code 1
2020-09-15 08:42:20.835 UTC [27868] LOG: logical replication apply worker for subscription "sub1" has started
2020-09-15 08:42:21.462 UTC [27868] ERROR: cache lookup failed for relation 0
2020-09-15 08:42:21.508 UTC [26427] LOG: background worker "logical replication worker" (PID 27868) exited with exit code 1
2020-09-15 08:42:21.921 UTC [27870] LOG: logical replication apply worker for subscription "sub1" has started
2020-09-15 08:42:22.551 UTC [27870] ERROR: cache lookup failed for relation 0
* REL_13_STABLE
master
[centos@clobber-cache-db93 subscription]$ tail -f tmp_check/log/013_partition_subscriber1.log
2020-09-15 08:42:19.763 UTC [27866] LOG: logical replication apply worker for subscription "sub1" has started
2020-09-15 08:42:20.395 UTC [27866] ERROR: cache lookup failed for relation 0
2020-09-15 08:42:20.436 UTC [26427] LOG: background worker "logical replication worker" (PID 27866) exited with exit code 1
2020-09-15 08:42:20.835 UTC [27868] LOG: logical replication apply worker for subscription "sub1" has started
2020-09-15 08:42:21.462 UTC [27868] ERROR: cache lookup failed for relation 0
2020-09-15 08:42:21.508 UTC [26427] LOG: background worker "logical replication worker" (PID 27868) exited with exit code 1
2020-09-15 08:42:21.921 UTC [27870] LOG: logical replication apply worker for subscription "sub1" has started
2020-09-15 08:42:22.551 UTC [27870] ERROR: cache lookup failed for relation 0
Thanks.
--
--
Regards,
Neha Sharma
On Mon, Sep 14, 2020 at 8:50 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
In connection with a nearby thread, I tried to run the subscription
test suite in a CLOBBER_CACHE_ALWAYS build. I soon found that I had
to increase wal_receiver_timeout, but after doing this:
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 1488bff..5fe6810 100644
*** a/src/test/perl/PostgresNode.pm
--- b/src/test/perl/PostgresNode.pm
*************** sub init
*** 447,452 ****
--- 447,453 ----
print $conf "log_statement = all\n";
print $conf "log_replication_commands = on\n";
print $conf "wal_retrieve_retry_interval = '500ms'\n";
+ print $conf "wal_receiver_timeout = '10min'\n";
# If a setting tends to affect whether tests pass or fail, print it after
# TEMP_CONFIG. Otherwise, print it before TEMP_CONFIG, thereby permitting
I let it run overnight, and came back to find that it was stuck at
[03:02:15] t/013_partition.pl ................. 19/51
and had been for circa eight hours, where extrapolation from other tests
said it shouldn't take much over half an hour. Investigation found that
the subscriber was repeatedly failing like this:
2020-09-14 11:05:26.483 EDT [1030506] LOG: logical replication apply worker for subscription "sub1" has started
2020-09-14 11:05:27.139 EDT [1030506] ERROR: cache lookup failed for relation 0
2020-09-14 11:05:27.140 EDT [947156] LOG: background worker "logical replication worker" (PID 1030506) exited with exit code 1
2020-09-14 11:05:27.571 EDT [1030509] LOG: logical replication apply worker for subscription "sub1" has started
2020-09-14 11:05:28.227 EDT [1030509] ERROR: cache lookup failed for relation 0
2020-09-14 11:05:28.228 EDT [947156] LOG: background worker "logical replication worker" (PID 1030509) exited with exit code 1
The publisher's log shows no sign of distress:
2020-09-14 11:06:09.380 EDT [1030619] sub1 LOG: statement: SELECT pg_catalog.set_config('search_path', '', false);
2020-09-14 11:06:09.446 EDT [1030619] sub1 LOG: received replication command: IDENTIFY_SYSTEM
2020-09-14 11:06:09.446 EDT [1030619] sub1 LOG: received replication command: START_REPLICATION SLOT "sub1" LOGICAL 0/163CF08 (proto_version '2', publication_names '"pub1"')
2020-09-14 11:06:09.447 EDT [1030619] sub1 LOG: starting logical decoding for slot "sub1"
2020-09-14 11:06:09.447 EDT [1030619] sub1 DETAIL: Streaming transactions committing after 0/163D848, reading WAL from 0/163CF08.
2020-09-14 11:06:09.447 EDT [1030619] sub1 LOG: logical decoding found consistent point at 0/163CF08
2020-09-14 11:06:09.447 EDT [1030619] sub1 DETAIL: There are no running transactions.
2020-09-14 11:06:10.468 EDT [1030621] sub1 LOG: statement: SELECT pg_catalog.set_config('search_path', '', false);
2020-09-14 11:06:10.533 EDT [1030621] sub1 LOG: received replication command: IDENTIFY_SYSTEM
2020-09-14 11:06:10.534 EDT [1030621] sub1 LOG: received replication command: START_REPLICATION SLOT "sub1" LOGICAL 0/163CF08 (proto_version '2', publication_names '"pub1"')
2020-09-14 11:06:10.534 EDT [1030621] sub1 LOG: starting logical decoding for slot "sub1"
2020-09-14 11:06:10.534 EDT [1030621] sub1 DETAIL: Streaming transactions committing after 0/163D848, reading WAL from 0/163CF08.
2020-09-14 11:06:10.534 EDT [1030621] sub1 LOG: logical decoding found consistent point at 0/163CF08
2020-09-14 11:06:10.534 EDT [1030621] sub1 DETAIL: There are no running transactions.
I do not have time today to chase this further, but somebody should.
More generally, this seems like good evidence that we really oughta have a
buildfarm member that's running *all* the tests under CLOBBER_CACHE_ALWAYS
not just the core tests. That seems impossibly expensive, but I realized
while watching these tests that a ridiculous fraction of the runtime is
being spent in repeated initdb calls. On my machine, initdb takes about
12 minutes under CCA, so doing it separately for publisher and subscriber
means 24 minutes, which compares not at all favorably to the
circa-half-an-hour total runtime of each of the subscription test scripts.
We're surely not learning anything after the first CCA run of initdb, so
if we could find a way to skip that overhead for later runs, it'd make a
huge difference in the practicality of running these tests under CCA.
I recall having worked on a patch to make the regression tests run
initdb just once, creating a template directory tree, and then "cp -a"
that into place for each test. I did not finish it, because it wasn't
showing a lot of advantage in a normal test run, but maybe the idea
could be resurrected for CCA and other slow builds.
Another idea is to make CCA a little more dynamic, say allow it to be
suppressed through an environment variable setting, and then use that
to speed up per-test initdbs.
regards, tom lane
I wrote: > [ $subject ] I found some time to trace this down, and what it turns out to be is that apply_handle_truncate() is making use of a LogicalRepRelMapEntry's localreloid field without any consideration for the possibility that that's been set to zero as a result of a cache flush. The visible symptom of "cache lookup failed for relation 0" comes from trying to invoke find_all_inheritors with a zero OID. Now, study of apply_handle_truncate doesn't immediately reveal where a cache flush could have occurred, but I realized that it's actually possible that the LogicalRepRelMapEntry is *already* marked invalid when logicalrep_rel_open() returns! That's because for some reason it does GetSubscriptionRelState last, after it's already marked the entry valid, and that function does plenty o' catalog accesses. It's not really clear to me why setting localreloid to zero is a sane way to represent "this entry needs to be revalidated". I think a separate flag would be more appropriate. Once we have lock on the target relation, it seems to me that no interesting changes should be possible as long as we have lock; so there's no very good reason to destroy useful state to remind ourselves that we should recheck it next time. regards, tom lane
I wrote: > It's not really clear to me why setting localreloid to zero is a sane > way to represent "this entry needs to be revalidated". I think a > separate flag would be more appropriate. Once we have lock on the > target relation, it seems to me that no interesting changes should > be possible as long as we have lock; so there's no very good reason > to destroy useful state to remind ourselves that we should recheck > it next time. Here's a patch that changes that, and also cleans up some sloppy thinking about how to re-acquire lock on the replication target relation. (Just because the OID was valid last you heard does not mean that table_open is guaranteed to succeed.) With this, we get through 013_partition.pl under CCA. I plan to try to run all of subscription/ and recovery/ before concluding there's nothing else to fix, though. regards, tom lane diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c index 3d2d56295b..9ee70a2563 100644 --- a/src/backend/replication/logical/relation.c +++ b/src/backend/replication/logical/relation.c @@ -77,7 +77,7 @@ logicalrep_relmap_invalidate_cb(Datum arg, Oid reloid) { if (entry->localreloid == reloid) { - entry->localreloid = InvalidOid; + entry->localrelvalid = false; hash_seq_term(&status); break; } @@ -91,7 +91,7 @@ logicalrep_relmap_invalidate_cb(Datum arg, Oid reloid) hash_seq_init(&status, LogicalRepRelMap); while ((entry = (LogicalRepRelMapEntry *) hash_seq_search(&status)) != NULL) - entry->localreloid = InvalidOid; + entry->localrelvalid = false; } } @@ -230,15 +230,13 @@ logicalrep_rel_att_by_name(LogicalRepRelation *remoterel, const char *attname) /* * Open the local relation associated with the remote one. * - * Optionally rebuilds the Relcache mapping if it was invalidated - * by local DDL. + * Rebuilds the Relcache mapping if it was invalidated by local DDL. */ LogicalRepRelMapEntry * logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode) { LogicalRepRelMapEntry *entry; bool found; - Oid relid = InvalidOid; LogicalRepRelation *remoterel; if (LogicalRepRelMap == NULL) @@ -254,14 +252,45 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode) remoterel = &entry->remoterel; + /* Ensure we don't leak a relcache refcount. */ + if (entry->localrel) + elog(ERROR, "remote relation ID %u is already open", remoteid); + /* * When opening and locking a relation, pending invalidation messages are - * processed which can invalidate the relation. We need to update the - * local cache both when we are first time accessing the relation and when - * the relation is invalidated (aka entry->localreloid is set InvalidOid). + * processed which can invalidate the relation. Hence, if the entry is + * currently considered valid, try to open the local relation by OID and + * see if invalidation ensues. + */ + if (entry->localrelvalid) + { + entry->localrel = try_table_open(entry->localreloid, lockmode); + if (!entry->localrel) + { + /* Table was renamed or dropped. */ + entry->localrelvalid = false; + } + else if (!entry->localrelvalid) + { + /* Note we release the no-longer-useful lock here. */ + table_close(entry->localrel, lockmode); + entry->localrel = NULL; + } + } + + /* + * If the entry has been marked invalid since we last had lock on it, + * re-open the local relation by name and rebuild all derived data. */ - if (!OidIsValid(entry->localreloid)) + if (!entry->localrelvalid) { + Oid relid; + int found; + Bitmapset *idkey; + TupleDesc desc; + MemoryContext oldctx; + int i; + /* Try to find and lock the relation by name. */ relid = RangeVarGetRelid(makeRangeVar(remoterel->nspname, remoterel->relname, -1), @@ -272,21 +301,7 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode) errmsg("logical replication target relation \"%s.%s\" does not exist", remoterel->nspname, remoterel->relname))); entry->localrel = table_open(relid, NoLock); - - } - else - { - relid = entry->localreloid; - entry->localrel = table_open(entry->localreloid, lockmode); - } - - if (!OidIsValid(entry->localreloid)) - { - int found; - Bitmapset *idkey; - TupleDesc desc; - MemoryContext oldctx; - int i; + entry->localreloid = relid; /* Check for supported relkind. */ CheckSubscriptionRelkind(entry->localrel->rd_rel->relkind, @@ -380,7 +395,7 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode) } } - entry->localreloid = relid; + entry->localrelvalid = true; } if (entry->state != SUBREL_STATE_READY) @@ -523,7 +538,7 @@ logicalrep_partmap_invalidate_cb(Datum arg, Oid reloid) { if (entry->localreloid == reloid) { - entry->localreloid = InvalidOid; + entry->localrelvalid = false; hash_seq_term(&status); break; } @@ -537,7 +552,7 @@ logicalrep_partmap_invalidate_cb(Datum arg, Oid reloid) hash_seq_init(&status, LogicalRepPartMap); while ((entry = (LogicalRepRelMapEntry *) hash_seq_search(&status)) != NULL) - entry->localreloid = InvalidOid; + entry->localrelvalid = false; } } @@ -656,6 +671,8 @@ logicalrep_partition_open(LogicalRepRelMapEntry *root, entry->updatable = root->updatable; + entry->localrelvalid = true; + /* state and statelsn are left set to 0. */ MemoryContextSwitchTo(oldctx); diff --git a/src/include/replication/logicalrelation.h b/src/include/replication/logicalrelation.h index a6b44b12bd..62ddd3c7a2 100644 --- a/src/include/replication/logicalrelation.h +++ b/src/include/replication/logicalrelation.h @@ -19,9 +19,16 @@ typedef struct LogicalRepRelMapEntry { LogicalRepRelation remoterel; /* key is remoterel.remoteid */ - /* Mapping to local relation, filled as needed. */ + /* + * Validity flag -- when false, revalidate all derived info at next + * logicalrep_rel_open. (While the localrel is open, we assume our lock + * on that rel ensures the info remains good.) + */ + bool localrelvalid; + + /* Mapping to local relation. */ Oid localreloid; /* local relation id */ - Relation localrel; /* relcache entry */ + Relation localrel; /* relcache entry (NULL when closed) */ AttrMap *attrmap; /* map of local attributes to remote ones */ bool updatable; /* Can apply updates/deletes? */
I wrote: > With this, we get through 013_partition.pl under CCA. I plan to > try to run all of subscription/ and recovery/ before concluding > there's nothing else to fix, though. Looks like the rest passes. FTR, it was possible to get through subscription/ in about 2 hours on my workstation, and recovery/ in about 30 minutes, after hacking things so that initdb used a non-CCA backend. So that's definitely a promising thing to look at if anyone wants to try to do this on a regular basis. Aside from increasing wal_receiver_timeout as previously mentioned, I found I had to set a higher pg_ctl start timeout using PGCTLTIMEOUT in order to get through the recovery tests. regards, tom lane
On Wed, Sep 16, 2020 at 1:16 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I wrote: > It's not really clear to me why setting localreloid to zero is a sane > way to represent "this entry needs to be revalidated". I think a > separate flag would be more appropriate. Once we have lock on the > target relation, it seems to me that no interesting changes should > be possible as long as we have lock; so there's no very good reason > to destroy useful state to remind ourselves that we should recheck > it next time. > So, can we assume that the current code can only cause the problem in CCA builds bot not in any practical scenario because after having a lock on relation probably there shouldn't be any invalidation which leads to this problem? -- With Regards, Amit Kapila.
Amit Kapila <amit.kapila16@gmail.com> writes: > So, can we assume that the current code can only cause the problem in > CCA builds bot not in any practical scenario because after having a > lock on relation probably there shouldn't be any invalidation which > leads to this problem? No. The reason we expend so much time and effort on CCA testing is that cache flushes are unpredictable, and they can happen even when you have a lock on the object(s) in question. In particular, the easiest real-world case that could cause the described problem is an sinval queue overrun that we detect during the GetSubscriptionRelState call at the bottom of logicalrep_rel_open. In that case we'll come back to the caller with the LogicalRepRelMapEntry marked as already needing revalidation, because the cache inval callback will have marked *all* of them that way. That's actually a harmless condition, because we have lock on the rel so nothing really changed ... but if we blew away localreloid and the caller needs to use that, kaboom. We could imagine marking the entry valid at the very bottom of logicalrep_rel_open, but that just moves the problem somewhere else. Any caller that does *any* catalog access while holding open a LogicalRepRelMapEntry would not be able to rely on its localreloid staying valid. That's a recipe for irreproducible bugs, and it's unnecessary. In practice the entry is good as long as we continue to hold a lock on the local relation. So we should mark LogicalRepRelMapEntries as potentially-needing-revalidation in a way that doesn't interfere with active users of the entry. In short: the value of CCA testing is to model sinval overruns happening at any point where they could happen. The real-world odds of one happening at any given instant are low, but they're never zero. regards, tom lane
On Wed, Sep 16, 2020 at 9:41 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Amit Kapila <amit.kapila16@gmail.com> writes: > > So, can we assume that the current code can only cause the problem in > > CCA builds bot not in any practical scenario because after having a > > lock on relation probably there shouldn't be any invalidation which > > leads to this problem? > > > In short: the value of CCA testing is to model sinval overruns > happening at any point where they could happen. The real-world > odds of one happening at any given instant are low, but they're > never zero. > Thanks for the explanation. I have read your patch and it looks good to me. -- With Regards, Amit Kapila.
Amit Kapila <amit.kapila16@gmail.com> writes: > Thanks for the explanation. I have read your patch and it looks good to me. Pushed, thanks for checking the patch. regards, tom lane