Re: Subscription test 013_partition.pl fails under CLOBBER_CACHE_ALWAYS - Mailing list pgsql-hackers

From Neha Sharma
Subject Re: Subscription test 013_partition.pl fails under CLOBBER_CACHE_ALWAYS
Date
Msg-id CANiYTQvsak+efQtnuiBNwKwmnbtCur84-jHrf1mErH7LrW66JQ@mail.gmail.com
Whole thread Raw
In response to Subscription test 013_partition.pl fails under CLOBBER_CACHE_ALWAYS  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Hi Tom,

I have tested the subscription test 013_partition.pl with CCA enabled on HEAD and PG13,
and I am able to reproduce the issue on both the versions.

Logs:
[centos@clobber-cache subscription]$ git branch
* REL_13_STABLE
  master
[centos@clobber-cache-db93 subscription]$ tail -f tmp_check/log/013_partition_subscriber1.log
2020-09-15 08:42:19.763 UTC [27866] LOG:  logical replication apply worker for subscription "sub1" has started
2020-09-15 08:42:20.395 UTC [27866] ERROR:  cache lookup failed for relation 0
2020-09-15 08:42:20.436 UTC [26427] LOG:  background worker "logical replication worker" (PID 27866) exited with exit code 1
2020-09-15 08:42:20.835 UTC [27868] LOG:  logical replication apply worker for subscription "sub1" has started
2020-09-15 08:42:21.462 UTC [27868] ERROR:  cache lookup failed for relation 0
2020-09-15 08:42:21.508 UTC [26427] LOG:  background worker "logical replication worker" (PID 27868) exited with exit code 1
2020-09-15 08:42:21.921 UTC [27870] LOG:  logical replication apply worker for subscription "sub1" has started
2020-09-15 08:42:22.551 UTC [27870] ERROR:  cache lookup failed for relation 0


Thanks.
--
Regards,
Neha Sharma


On Mon, Sep 14, 2020 at 8:50 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
In connection with a nearby thread, I tried to run the subscription
test suite in a CLOBBER_CACHE_ALWAYS build.  I soon found that I had
to increase wal_receiver_timeout, but after doing this:

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 1488bff..5fe6810 100644
*** a/src/test/perl/PostgresNode.pm
--- b/src/test/perl/PostgresNode.pm
*************** sub init
*** 447,452 ****
--- 447,453 ----
    print $conf "log_statement = all\n";
    print $conf "log_replication_commands = on\n";
    print $conf "wal_retrieve_retry_interval = '500ms'\n";
+   print $conf "wal_receiver_timeout = '10min'\n";

    # If a setting tends to affect whether tests pass or fail, print it after
    # TEMP_CONFIG.  Otherwise, print it before TEMP_CONFIG, thereby permitting

I let it run overnight, and came back to find that it was stuck at

[03:02:15] t/013_partition.pl ................. 19/51

and had been for circa eight hours, where extrapolation from other tests
said it shouldn't take much over half an hour.  Investigation found that
the subscriber was repeatedly failing like this:

2020-09-14 11:05:26.483 EDT [1030506] LOG:  logical replication apply worker for subscription "sub1" has started
2020-09-14 11:05:27.139 EDT [1030506] ERROR:  cache lookup failed for relation 0
2020-09-14 11:05:27.140 EDT [947156] LOG:  background worker "logical replication worker" (PID 1030506) exited with exit code 1
2020-09-14 11:05:27.571 EDT [1030509] LOG:  logical replication apply worker for subscription "sub1" has started
2020-09-14 11:05:28.227 EDT [1030509] ERROR:  cache lookup failed for relation 0
2020-09-14 11:05:28.228 EDT [947156] LOG:  background worker "logical replication worker" (PID 1030509) exited with exit code 1

The publisher's log shows no sign of distress:

2020-09-14 11:06:09.380 EDT [1030619] sub1 LOG:  statement: SELECT pg_catalog.set_config('search_path', '', false);
2020-09-14 11:06:09.446 EDT [1030619] sub1 LOG:  received replication command: IDENTIFY_SYSTEM
2020-09-14 11:06:09.446 EDT [1030619] sub1 LOG:  received replication command: START_REPLICATION SLOT "sub1" LOGICAL 0/163CF08 (proto_version '2', publication_names '"pub1"')
2020-09-14 11:06:09.447 EDT [1030619] sub1 LOG:  starting logical decoding for slot "sub1"
2020-09-14 11:06:09.447 EDT [1030619] sub1 DETAIL:  Streaming transactions committing after 0/163D848, reading WAL from 0/163CF08.
2020-09-14 11:06:09.447 EDT [1030619] sub1 LOG:  logical decoding found consistent point at 0/163CF08
2020-09-14 11:06:09.447 EDT [1030619] sub1 DETAIL:  There are no running transactions.
2020-09-14 11:06:10.468 EDT [1030621] sub1 LOG:  statement: SELECT pg_catalog.set_config('search_path', '', false);
2020-09-14 11:06:10.533 EDT [1030621] sub1 LOG:  received replication command: IDENTIFY_SYSTEM
2020-09-14 11:06:10.534 EDT [1030621] sub1 LOG:  received replication command: START_REPLICATION SLOT "sub1" LOGICAL 0/163CF08 (proto_version '2', publication_names '"pub1"')
2020-09-14 11:06:10.534 EDT [1030621] sub1 LOG:  starting logical decoding for slot "sub1"
2020-09-14 11:06:10.534 EDT [1030621] sub1 DETAIL:  Streaming transactions committing after 0/163D848, reading WAL from 0/163CF08.
2020-09-14 11:06:10.534 EDT [1030621] sub1 LOG:  logical decoding found consistent point at 0/163CF08
2020-09-14 11:06:10.534 EDT [1030621] sub1 DETAIL:  There are no running transactions.

I do not have time today to chase this further, but somebody should.

More generally, this seems like good evidence that we really oughta have a
buildfarm member that's running *all* the tests under CLOBBER_CACHE_ALWAYS
not just the core tests.  That seems impossibly expensive, but I realized
while watching these tests that a ridiculous fraction of the runtime is
being spent in repeated initdb calls.  On my machine, initdb takes about
12 minutes under CCA, so doing it separately for publisher and subscriber
means 24 minutes, which compares not at all favorably to the
circa-half-an-hour total runtime of each of the subscription test scripts.
We're surely not learning anything after the first CCA run of initdb, so
if we could find a way to skip that overhead for later runs, it'd make a
huge difference in the practicality of running these tests under CCA.

I recall having worked on a patch to make the regression tests run
initdb just once, creating a template directory tree, and then "cp -a"
that into place for each test.  I did not finish it, because it wasn't
showing a lot of advantage in a normal test run, but maybe the idea
could be resurrected for CCA and other slow builds.

Another idea is to make CCA a little more dynamic, say allow it to be
suppressed through an environment variable setting, and then use that
to speed up per-test initdbs.

                        regards, tom lane


pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING
Next
From: Tom Lane
Date:
Subject: "Unified logging system" breaks access to pg_dump debug outputs