Re: subscriptionCheck failures on nightjar - Mailing list pgsql-hackers

From Tom Lane
Subject Re: subscriptionCheck failures on nightjar
Date
Msg-id 27965.1550077052@sss.pgh.pa.us
Whole thread Raw
In response to Re: subscriptionCheck failures on nightjar  (Thomas Munro <thomas.munro@enterprisedb.com>)
Responses Re: subscriptionCheck failures on nightjar
List pgsql-hackers
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> On Mon, Feb 11, 2019 at 7:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> 2019-02-10 23:55:58.798 EST [40728] sub1 PANIC:  could not open file "pg_logical/snapshots/0-160B578.snap": No such
fileor directory 

> <pokes at totally unfamiliar code>

> They get atomically renamed into place, which seems kosher even if
> snapshots for the same LSN are created concurrently by different
> backends (and tracing syscalls confirms that that does occasionally
> happen).  It's hard to believe that nightjar's rename() ceased to be
> atomic a couple of months ago.  It looks like the only way for files
> to get unlinked after that is by CheckPointSnapBuild() deciding they
> are too old.

> Hmm.  Could this be relevant, and cause a well timed checkpoint to
> unlink files too soon?
> 2019-02-12 21:52:58.304 EST [22922] WARNING:  out of logical
> replication worker slots

I've managed to reproduce this locally, and obtained this PANIC:

log/010_truncate_publisher.log:2019-02-13 11:29:04.759 EST [9973] sub3 PANIC:  could not open file
"pg_logical/snapshots/0-16067B0.snap":No such file or directory 

with this stack trace:

#0  0x0000000801ab610c in kill () from /lib/libc.so.7
#1  0x0000000801ab4d3b in abort () from /lib/libc.so.7
#2  0x000000000089202e in errfinish (dummy=Variable "dummy" is not available.
) at elog.c:552
#3  0x000000000075d339 in fsync_fname_ext (
    fname=0x7fffffffba20 "pg_logical/snapshots/0-16067B0.snap", isdir=Variable "isdir" is not available.
)
    at fd.c:3372
#4  0x0000000000730c75 in SnapBuildSerialize (builder=0x80243c118,
    lsn=23095216) at snapbuild.c:1669
#5  0x0000000000731297 in SnapBuildProcessRunningXacts (builder=0x80243c118,
    lsn=23095216, running=0x8024424f0) at snapbuild.c:1110
#6  0x0000000000722eac in LogicalDecodingProcessRecord (ctx=0x802414118,
    record=0x8024143d8) at decode.c:318
#7  0x0000000000736f30 in XLogSendLogical () at walsender.c:2826
#8  0x00000000007389c7 in WalSndLoop (send_data=0x736ed0 <XLogSendLogical>)
    at walsender.c:2184
#9  0x0000000000738c3d in StartLogicalReplication (cmd=Variable "cmd" is not available.
) at walsender.c:1118
#10 0x0000000000739895 in exec_replication_command (
    cmd_string=0x80240e118 "START_REPLICATION SLOT \"sub3\" LOGICAL 0/0 (proto_version '1', publication_names
'\"pub3\"')")at walsender.c:1536 
#11 0x000000000078b272 in PostgresMain (argc=-14464, argv=Variable "argv" is not available.
) at postgres.c:4252
#12 0x00000000007113fa in PostmasterMain (argc=-14256, argv=0x7fffffffcc60)
    at postmaster.c:4382

So the problem seems to boil down to "somebody removed the snapshot
file between the time we renamed it into place and the time we tried
to fsync it".

I do find messages like the one you mention, but they are on the
subscriber not the publisher side, so I'm not sure if this matches
the scenario you have in mind?

log/010_truncate_subscriber.log:2019-02-13 11:29:02.343 EST [9970] WARNING:  out of logical replication worker slots
log/010_truncate_subscriber.log:2019-02-13 11:29:02.344 EST [9970] WARNING:  out of logical replication worker slots
log/010_truncate_subscriber.log:2019-02-13 11:29:03.401 EST [9970] WARNING:  out of logical replication worker slots

Anyway, I think we might be able to fix this along the lines of

    CloseTransientFile(fd);

+   /* ensure snapshot file is down to stable storage */
+   fsync_fname(tmppath, false);
    fsync_fname("pg_logical/snapshots", true);

    /*
     * We may overwrite the work from some other backend, but that's ok, our
     * snapshot is valid as well, we'll just have done some superfluous work.
     */
    if (rename(tmppath, path) != 0)
    {
        ereport(ERROR,
                (errcode_for_file_access(),
                 errmsg("could not rename file \"%s\" to \"%s\": %m",
                        tmppath, path)));
    }

-   /* make sure we persist */
-   fsync_fname(path, false);
+   /* ensure we persist the file rename */
    fsync_fname("pg_logical/snapshots", true);

The existing code here seems simply wacky/unsafe to me regardless
of this race condition: couldn't it potentially result in a corrupt
snapshot file appearing with a valid name, if the system crashes
after persisting the rename but before it's pushed the data out?

I also wonder why bother with the directory sync just before the
rename.

            regards, tom lane


pgsql-hackers by date:

Previous
From: Abhishek Agrawal
Date:
Subject: Regarding participating in GSOC 2019 with PostgreSQL
Next
From: Andres Freund
Date:
Subject: Re: reducing isolation tests runtime