Thread: Using XLogFileNameP in critical section
Hi, I encountered that the assertion error is reported instead of a proper PANIC message when failed to fsync WAL. The cause is that there are multiple places where we call XLogFileNameP function that calls palloc during critical section, for example XLogWrite function. TRAP: FailedAssertion("CritSectionCount == 0 || (context)->allowInCritSection", File: "mcxt.c", Line: 956) As far as I can see there are five places we need to fix.I've attached a patch. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Fri, Nov 29, 2019 at 06:44:58PM +0100, Masahiko Sawada wrote: > I encountered that the assertion error is reported instead of a proper > PANIC message when failed to fsync WAL. The cause is that there are > multiple places where we call XLogFileNameP function that calls palloc > during critical section, for example XLogWrite function. > > TRAP: FailedAssertion("CritSectionCount == 0 || > (context)->allowInCritSection", File: "mcxt.c", Line: 956) > > As far as I can see there are five places we need to fix. I've > attached a patch. + msg = "could not fdatasync file \"%s\": %m"; Missing some translations, no? You are missing a couple of code paths in walreceiver.c, XLogWalRcvWrite(), where XLogFileNameP is used on a PANIC. This brings me the following points: 1) If you look closely, all the callers of XLogFileNameP() are used for the generation of error strings. 2) I am ready to bet that we'll have the same discussion in the future because somebody will make the same mistake for a new code path. I think that we had better just get rid of XLogFileNameP() (on HEAD) and just change those code paths so as they use a buffer of size MAXFNAMELEN, with XLogFileName() generating the file name. This leads actually to some simplifications, see for example XLogWalRcvWrite.. -- Michael
Attachment
On 2019-Dec-02, Michael Paquier wrote: > I think that we had better just get rid of XLogFileNameP() (on HEAD) > and just change those code paths so as they use a buffer of size > MAXFNAMELEN, with XLogFileName() generating the file name. This leads > actually to some simplifications, see for example XLogWalRcvWrite.. +1, that function is a footgun and there's no possible way to fix it so that it's never misused(*). Also, put a comment atop XLogFileName indicating never to create such a "helper" function again. (*) unless there was a way to have a StaticAssert() that the function is not called in a critical section -- have the compiler work for us. I don't think it's possible to write such a thing. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, 2 Dec 2019 at 16:56, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > On 2019-Dec-02, Michael Paquier wrote: > > > I think that we had better just get rid of XLogFileNameP() (on HEAD) > > and just change those code paths so as they use a buffer of size > > MAXFNAMELEN, with XLogFileName() generating the file name. This leads > > actually to some simplifications, see for example XLogWalRcvWrite.. > > +1, that function is a footgun and there's no possible way to fix it so > that it's never misused(*). Also, put a comment atop XLogFileName > indicating never to create such a "helper" function again. > > (*) unless there was a way to have a StaticAssert() that the function is > not called in a critical section -- have the compiler work for us. I > don't think it's possible to write such a thing. Agreed. I've attached the updated version patch. Please review it. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Mon, Dec 02, 2019 at 10:14:58PM +0100, Masahiko Sawada wrote: > Agreed. I've attached the updated version patch. Please review it. Thanks, applied on HEAD after a few edits. gettext() does not set errno, so the new style of issue_xlog_fsync() is actually fine. Please note that there was one mistake in the patch: you forgot to assign back errno in assign_xlog_sync_method() after generating the file name. Per the low probability of the failures, I did not backpatch that stuff. I quickly looked at applying that further down, and attached is a version for v12 FWIW, and I suspect much more conflicts the more you go down (wal segment size added in 11, different code paths for replication, etc.). -- Michael
Attachment
On 2019-Dec-03, Michael Paquier wrote: > Per the low probability of the failures, I did not backpatch that > stuff. I quickly looked at applying that further down, and attached > is a version for v12 FWIW, and I suspect much more conflicts the more > you go down (wal segment size added in 11, different code paths for > replication, etc.). You didn't attach anything, but I concur about the low probability aspect: the assertion failure does not occur in production builds (obviously); and only an out-of-memory situation is a real problem when an fsync fails. Anyway this should be a very localized fix, right? I'm not sure that the internationalization stuff in issue_xlog_fsync is correct. I think the _() should be gettext_noop(), or alternatively the errmsg() should be errmsg_internal(); otherwise the translation is invoked twice. (I didn't verify this.) -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > I'm not sure that the internationalization stuff in issue_xlog_fsync is > correct. I think the _() should be gettext_noop(), or alternatively the > errmsg() should be errmsg_internal(); otherwise the translation is > invoked twice. (I didn't verify this.) Also, buildfarm member drongo is not happy: postgres.def : error LNK2001: unresolved external symbol XLogFileNameP [C:\prog\bf\root\HEAD\pgsql.build\postgres.vcxproj] Release/postgres/postgres.lib : fatal error LNK1120: 1 unresolved externals [C:\prog\bf\root\HEAD\pgsql.build\postgres.vcxproj] I'm guessing you missed a reference someplace. regards, tom lane
I wrote: > Also, buildfarm member drongo is not happy: > postgres.def : error LNK2001: unresolved external symbol XLogFileNameP [C:\prog\bf\root\HEAD\pgsql.build\postgres.vcxproj] > Release/postgres/postgres.lib : fatal error LNK1120: 1 unresolved externals [C:\prog\bf\root\HEAD\pgsql.build\postgres.vcxproj] > I'm guessing you missed a reference someplace. Hm ... grep swears up and down that there is no remaining instance of the string "XLogFileNameP" anywhere in the tree. So this doesn't seem to be the fault of 9989d37d1 per se. What my eye now falls on is this, a bit further up in the build log [1]: ... PreLinkEvent: Generate DEF file perl src\tools\msvc\gendef.pl Release\postgres x64 :VCEnd Not re-generating POSTGRES.DEF, file already exists. Link: ... So it seems that the problem might really be a faulty rule in our MSVC build script about when postgres.def needs to be regenerated? Or else it's some weird caching problem on drongo --- the lack of complaints from other Windows critters might point the finger that way. regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2019-12-03%2007%3A30%3A01
On Tue, 3 Dec 2019 at 07:09, Michael Paquier <michael@paquier.xyz> wrote: > > On Mon, Dec 02, 2019 at 10:14:58PM +0100, Masahiko Sawada wrote: > > Agreed. I've attached the updated version patch. Please review it. > > Thanks, applied on HEAD after a few edits. gettext() does not set > errno, so the new style of issue_xlog_fsync() is actually fine. Thanks! > Please note that there was one mistake in the patch: you forgot to > assign back errno in assign_xlog_sync_method() after generating the > file name. My bad. Thank you for fixing it. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Dec 03, 2019 at 11:24:57AM -0500, Tom Lane wrote: > So it seems that the problem might really be a faulty rule in our > MSVC build script about when postgres.def needs to be regenerated? > Or else it's some weird caching problem on drongo --- the lack of > complaints from other Windows critters might point the finger > that way. Yes, I saw the failure from the buildfarm logs, but I got to the conclusion that the animal just got crazy with a portion of its caching because there are no more references to routine removed. So I did not bother much. FWIW, I have seen sometimes similar warnings regarding conflicting projects when doing a compilation on MSVC, applying a patch and then recompiling (because that's just faster than recompiling the whole). Perhaps we could do better here. I am not completely sure what though, it's not a problem I have spent much brain-time on. -- Michael
Attachment
On Tue, Dec 03, 2019 at 09:35:00AM -0300, Alvaro Herrera wrote: > You didn't attach anything, but I concur about the low probability > aspect: the assertion failure does not occur in production builds > (obviously); and only an out-of-memory situation is a real problem > when > an fsync fails. Anyway this should be a very localized fix, right? Sorry. You get something like the attached. The recent refactoring work you committed in this area causes already conflicts on REL_12_STABLE. > I'm not sure that the internationalization stuff in issue_xlog_fsync > is correct. I think the _() should be gettext_noop(), or alternatively > the errmsg() should be errmsg_internal(); otherwise the translation is > invoked twice. (I didn't verify this.) Hmm. We actually do both in tablecmds.c:ATWrongRelkindError(), and that's the code I was looking at yesterday when thinking about the problem.. However, parse_agg.c, parse_expr.c and parse_func.c among others like vacuumlazy.c use directly errmsg_internal() without translating the string first. So there is indeed duplicated work for both. Does the attached patch look correct to you? -- Michael
Attachment
On 12/3/19 11:24 AM, Tom Lane wrote: > I wrote: >> Also, buildfarm member drongo is not happy: >> postgres.def : error LNK2001: unresolved external symbol XLogFileNameP [C:\prog\bf\root\HEAD\pgsql.build\postgres.vcxproj] >> Release/postgres/postgres.lib : fatal error LNK1120: 1 unresolved externals [C:\prog\bf\root\HEAD\pgsql.build\postgres.vcxproj] >> I'm guessing you missed a reference someplace. > Hm ... grep swears up and down that there is no remaining instance > of the string "XLogFileNameP" anywhere in the tree. So this doesn't > seem to be the fault of 9989d37d1 per se. What my eye now falls on > is this, a bit further up in the build log [1]: > > ... > PreLinkEvent: > Generate DEF file > perl src\tools\msvc\gendef.pl Release\postgres x64 > :VCEnd > Not re-generating POSTGRES.DEF, file already exists. > Link: > ... > > So it seems that the problem might really be a faulty rule in our > MSVC build script about when postgres.def needs to be regenerated? > Or else it's some weird caching problem on drongo --- the lack of > complaints from other Windows critters might point the finger > that way. > > regards, tom lane > > [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2019-12-03%2007%3A30%3A01 > this was pilot error on my part. Should be fixed now. cheers andrew