Re: 001_rep_changes.pl fails due to publisher stuck on shutdown - Mailing list pgsql-hackers

From Peter Smith
Subject Re: 001_rep_changes.pl fails due to publisher stuck on shutdown
Date
Msg-id CAHut+PtZk8Q3k_gymTqkiBueB=BLAXBuhRfvvbc3wstXg7bzUA@mail.gmail.com
Whole thread Raw
In response to 001_rep_changes.pl fails due to publisher stuck on shutdown  (Alexander Lakhin <exclusion@gmail.com>)
Responses Re: 001_rep_changes.pl fails due to publisher stuck on shutdown
List pgsql-hackers
Hi, I have reproduced this multiple times now.

I confirmed the initial post/steps from Alexander. i.e. The test
script provided [1] gets itself into a state where function
ReadPageInternal (called by XLogDecodeNextRecord and commented "Wait
for the next page to become available") constantly returns
XLREAD_FAIL. Ultimately the test times out because WalSndLoop() loops
forever, since it never calls WalSndDone() to exit the walsender
process.

~~~

I've made a patch to inject lots of logging, and when the test script
fails a cycle of function failures can be seen. I don't know how to
fix it yet, so I'm attaching my log results, hoping the information
may be useful for anyone familiar with this area of the code.

~~~

Attachment #1 "v1-0001-DEBUG-LOGGING.patch" -- Patch to inject some
logging. Be careful if you apply this because the resulting log files
can be huge (e.g. 3G)

Attachment #2 "bad8_logs_last500lines.txt" -- This is the last 500
lines of a 3G logfile from a failing test run.

Attachment #3 "bad8_logs_last500lines-simple.txt" -- Same log file as
above, but it's a simplified extract in which I showed the CYCLES of
failure more clearly.

Attachment #4 "bad8_digram"-- Same execution patch information as from
the log files, but in diagram form (just to help me visualise the
logic more easily).

~~~

Just so you know, the test script does not always cause the problem.
Sometimes it happens after just 20 script iterations. Or, sometimes it
takes a very long time and multiple runs (e.g. 400-500 script
iterations). Either way, when the problem eventually occurs the CYCLES
of the ReadPageInternal() failures always have the the same pattern
shown in these attached logs.

======
[1] OP - https://www.postgresql.org/message-id/f15d665f-4cd1-4894-037c-afdbe369287e%40gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: [multithreading] extension compatibility
Next
From: "Hayato Kuroda (Fujitsu)"
Date:
Subject: RE: Pgoutput not capturing the generated columns