Hi,
On 2019-09-19 17:20:15 +0530, Kuntal Ghosh wrote:
> It seems there is a pattern how the error is occurring in different
> systems. Following are the relevant log snippets:
>
> nightjar:
> sub3 LOG: received replication command: CREATE_REPLICATION_SLOT
> "sub3_16414_sync_16394" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT
> sub3 LOG: logical decoding found consistent point at 0/160B578
> sub1 PANIC: could not open file
> "pg_logical/snapshots/0-160B578.snap": No such file or directory
>
> dromedary scenario 1:
> sub3_16414_sync_16399 LOG: received replication command:
> CREATE_REPLICATION_SLOT "sub3_16414_sync_16399" TEMPORARY LOGICAL
> pgoutput USE_SNAPSHOT
> sub3_16414_sync_16399 LOG: logical decoding found consistent point at 0/15EA694
> sub2 PANIC: could not open file
> "pg_logical/snapshots/0-15EA694.snap": No such file or directory
>
>
> dromedary scenario 2:
> sub3_16414_sync_16399 LOG: received replication command:
> CREATE_REPLICATION_SLOT "sub3_16414_sync_16399" TEMPORARY LOGICAL
> pgoutput USE_SNAPSHOT
> sub3_16414_sync_16399 LOG: logical decoding found consistent point at 0/15EA694
> sub1 PANIC: could not open file
> "pg_logical/snapshots/0-15EA694.snap": No such file or directory
>
> While subscription 3 is created, it eventually reaches to a consistent
> snapshot point and prints the WAL location corresponding to it. It
> seems sub1/sub2 immediately fails to serialize the snapshot to the
> .snap file having the same WAL location.
Since now a number of people (I tried as well), failed to reproduce this
locally, I propose that we increase the log-level during this test on
master. And perhaps expand the set of debugging information. With the
hope that the additional information on the cases encountered on the bf
helps us build a reproducer or, even better, diagnose the issue
directly. If people agree, I'll come up with a patch.
Greetings,
Andres Freund