Re: subscriptionCheck failures on nightjar - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: subscriptionCheck failures on nightjar |
Date | |
Msg-id | 27965.1550077052@sss.pgh.pa.us Whole thread Raw |
In response to | Re: subscriptionCheck failures on nightjar (Thomas Munro <thomas.munro@enterprisedb.com>) |
Responses |
Re: subscriptionCheck failures on nightjar
|
List | pgsql-hackers |
Thomas Munro <thomas.munro@enterprisedb.com> writes: > On Mon, Feb 11, 2019 at 7:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> 2019-02-10 23:55:58.798 EST [40728] sub1 PANIC: could not open file "pg_logical/snapshots/0-160B578.snap": No such fileor directory > <pokes at totally unfamiliar code> > They get atomically renamed into place, which seems kosher even if > snapshots for the same LSN are created concurrently by different > backends (and tracing syscalls confirms that that does occasionally > happen). It's hard to believe that nightjar's rename() ceased to be > atomic a couple of months ago. It looks like the only way for files > to get unlinked after that is by CheckPointSnapBuild() deciding they > are too old. > Hmm. Could this be relevant, and cause a well timed checkpoint to > unlink files too soon? > 2019-02-12 21:52:58.304 EST [22922] WARNING: out of logical > replication worker slots I've managed to reproduce this locally, and obtained this PANIC: log/010_truncate_publisher.log:2019-02-13 11:29:04.759 EST [9973] sub3 PANIC: could not open file "pg_logical/snapshots/0-16067B0.snap":No such file or directory with this stack trace: #0 0x0000000801ab610c in kill () from /lib/libc.so.7 #1 0x0000000801ab4d3b in abort () from /lib/libc.so.7 #2 0x000000000089202e in errfinish (dummy=Variable "dummy" is not available. ) at elog.c:552 #3 0x000000000075d339 in fsync_fname_ext ( fname=0x7fffffffba20 "pg_logical/snapshots/0-16067B0.snap", isdir=Variable "isdir" is not available. ) at fd.c:3372 #4 0x0000000000730c75 in SnapBuildSerialize (builder=0x80243c118, lsn=23095216) at snapbuild.c:1669 #5 0x0000000000731297 in SnapBuildProcessRunningXacts (builder=0x80243c118, lsn=23095216, running=0x8024424f0) at snapbuild.c:1110 #6 0x0000000000722eac in LogicalDecodingProcessRecord (ctx=0x802414118, record=0x8024143d8) at decode.c:318 #7 0x0000000000736f30 in XLogSendLogical () at walsender.c:2826 #8 0x00000000007389c7 in WalSndLoop (send_data=0x736ed0 <XLogSendLogical>) at walsender.c:2184 #9 0x0000000000738c3d in StartLogicalReplication (cmd=Variable "cmd" is not available. ) at walsender.c:1118 #10 0x0000000000739895 in exec_replication_command ( cmd_string=0x80240e118 "START_REPLICATION SLOT \"sub3\" LOGICAL 0/0 (proto_version '1', publication_names '\"pub3\"')")at walsender.c:1536 #11 0x000000000078b272 in PostgresMain (argc=-14464, argv=Variable "argv" is not available. ) at postgres.c:4252 #12 0x00000000007113fa in PostmasterMain (argc=-14256, argv=0x7fffffffcc60) at postmaster.c:4382 So the problem seems to boil down to "somebody removed the snapshot file between the time we renamed it into place and the time we tried to fsync it". I do find messages like the one you mention, but they are on the subscriber not the publisher side, so I'm not sure if this matches the scenario you have in mind? log/010_truncate_subscriber.log:2019-02-13 11:29:02.343 EST [9970] WARNING: out of logical replication worker slots log/010_truncate_subscriber.log:2019-02-13 11:29:02.344 EST [9970] WARNING: out of logical replication worker slots log/010_truncate_subscriber.log:2019-02-13 11:29:03.401 EST [9970] WARNING: out of logical replication worker slots Anyway, I think we might be able to fix this along the lines of CloseTransientFile(fd); + /* ensure snapshot file is down to stable storage */ + fsync_fname(tmppath, false); fsync_fname("pg_logical/snapshots", true); /* * We may overwrite the work from some other backend, but that's ok, our * snapshot is valid as well, we'll just have done some superfluous work. */ if (rename(tmppath, path) != 0) { ereport(ERROR, (errcode_for_file_access(), errmsg("could not rename file \"%s\" to \"%s\": %m", tmppath, path))); } - /* make sure we persist */ - fsync_fname(path, false); + /* ensure we persist the file rename */ fsync_fname("pg_logical/snapshots", true); The existing code here seems simply wacky/unsafe to me regardless of this race condition: couldn't it potentially result in a corrupt snapshot file appearing with a valid name, if the system crashes after persisting the rename but before it's pushed the data out? I also wonder why bother with the directory sync just before the rename. regards, tom lane
pgsql-hackers by date: