Thread: Using XLogFileNameP in critical section

Using XLogFileNameP in critical section

From
Masahiko Sawada
Date:
Hi,

I encountered that the assertion error is reported instead of a proper
PANIC message when failed to fsync WAL. The cause is that there are
multiple places where we call XLogFileNameP function that calls palloc
during critical section, for example XLogWrite function.

TRAP: FailedAssertion("CritSectionCount == 0 ||
(context)->allowInCritSection", File: "mcxt.c", Line: 956)

As far as I can see there are five places we need to fix.I've attached a patch.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Using XLogFileNameP in critical section

From
Michael Paquier
Date:
On Fri, Nov 29, 2019 at 06:44:58PM +0100, Masahiko Sawada wrote:
> I encountered that the assertion error is reported instead of a proper
> PANIC message when failed to fsync WAL. The cause is that there are
> multiple places where we call XLogFileNameP function that calls palloc
> during critical section, for example XLogWrite function.
>
> TRAP: FailedAssertion("CritSectionCount == 0 ||
> (context)->allowInCritSection", File: "mcxt.c", Line: 956)
>
> As far as I can see there are five places we need to fix.  I've
> attached a patch.

+    msg = "could not fdatasync file \"%s\": %m";
Missing some translations, no?

You are missing a couple of code paths in walreceiver.c,
XLogWalRcvWrite(), where XLogFileNameP is used on a PANIC.  This
brings me the following points:
1) If you look closely, all the callers of XLogFileNameP() are used
for the generation of error strings.
2) I am ready to bet that we'll have the same discussion in the future
because somebody will make the same mistake for a new code path.

I think that we had better just get rid of XLogFileNameP() (on HEAD)
and just change those code paths so as they use a buffer of size
MAXFNAMELEN, with XLogFileName() generating the file name.  This leads
actually to some simplifications, see for example XLogWalRcvWrite..
--
Michael

Attachment

Re: Using XLogFileNameP in critical section

From
Alvaro Herrera
Date:
On 2019-Dec-02, Michael Paquier wrote:

> I think that we had better just get rid of XLogFileNameP() (on HEAD)
> and just change those code paths so as they use a buffer of size
> MAXFNAMELEN, with XLogFileName() generating the file name.  This leads
> actually to some simplifications, see for example XLogWalRcvWrite..

+1, that function is a footgun and there's no possible way to fix it so
that it's never misused(*).  Also, put a comment atop XLogFileName
indicating never to create such a "helper" function again.

(*) unless there was a way to have a StaticAssert() that the function is
not called in a critical section -- have the compiler work for us.  I
don't think it's possible to write such a thing.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Using XLogFileNameP in critical section

From
Masahiko Sawada
Date:
On Mon, 2 Dec 2019 at 16:56, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> On 2019-Dec-02, Michael Paquier wrote:
>
> > I think that we had better just get rid of XLogFileNameP() (on HEAD)
> > and just change those code paths so as they use a buffer of size
> > MAXFNAMELEN, with XLogFileName() generating the file name.  This leads
> > actually to some simplifications, see for example XLogWalRcvWrite..
>
> +1, that function is a footgun and there's no possible way to fix it so
> that it's never misused(*).  Also, put a comment atop XLogFileName
> indicating never to create such a "helper" function again.
>
> (*) unless there was a way to have a StaticAssert() that the function is
> not called in a critical section -- have the compiler work for us.  I
> don't think it's possible to write such a thing.

Agreed. I've attached the updated version patch. Please review it.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Using XLogFileNameP in critical section

From
Michael Paquier
Date:
On Mon, Dec 02, 2019 at 10:14:58PM +0100, Masahiko Sawada wrote:
> Agreed. I've attached the updated version patch. Please review it.

Thanks, applied on HEAD after a few edits.  gettext() does not set
errno, so the new style of issue_xlog_fsync() is actually fine.
Please note that there was one mistake in the patch: you forgot to
assign back errno in assign_xlog_sync_method() after generating the
file name.

Per the low probability of the failures, I did not backpatch that
stuff.  I quickly looked at applying that further down, and attached
is a version for v12 FWIW, and I suspect much more conflicts the more
you go down (wal segment size added in 11, different code paths for
replication, etc.).
--
Michael

Attachment

Re: Using XLogFileNameP in critical section

From
Alvaro Herrera
Date:
On 2019-Dec-03, Michael Paquier wrote:

> Per the low probability of the failures, I did not backpatch that
> stuff.  I quickly looked at applying that further down, and attached
> is a version for v12 FWIW, and I suspect much more conflicts the more
> you go down (wal segment size added in 11, different code paths for
> replication, etc.).

You didn't attach anything, but I concur about the low probability
aspect: the assertion failure does not occur in production builds
(obviously); and only an out-of-memory situation is a real problem when
an fsync fails.  Anyway this should be a very localized fix, right?

I'm not sure that the internationalization stuff in issue_xlog_fsync is
correct.  I think the _() should be gettext_noop(), or alternatively the
errmsg() should be errmsg_internal(); otherwise the translation is
invoked twice.  (I didn't verify this.)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Using XLogFileNameP in critical section

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> I'm not sure that the internationalization stuff in issue_xlog_fsync is
> correct.  I think the _() should be gettext_noop(), or alternatively the
> errmsg() should be errmsg_internal(); otherwise the translation is
> invoked twice.  (I didn't verify this.)

Also, buildfarm member drongo is not happy:

postgres.def : error LNK2001: unresolved external symbol XLogFileNameP
[C:\prog\bf\root\HEAD\pgsql.build\postgres.vcxproj]
Release/postgres/postgres.lib : fatal error LNK1120: 1 unresolved externals
[C:\prog\bf\root\HEAD\pgsql.build\postgres.vcxproj]

I'm guessing you missed a reference someplace.

            regards, tom lane



Re: Using XLogFileNameP in critical section

From
Tom Lane
Date:
I wrote:
> Also, buildfarm member drongo is not happy:
> postgres.def : error LNK2001: unresolved external symbol XLogFileNameP
[C:\prog\bf\root\HEAD\pgsql.build\postgres.vcxproj]
> Release/postgres/postgres.lib : fatal error LNK1120: 1 unresolved externals
[C:\prog\bf\root\HEAD\pgsql.build\postgres.vcxproj]
> I'm guessing you missed a reference someplace.

Hm ... grep swears up and down that there is no remaining instance
of the string "XLogFileNameP" anywhere in the tree.  So this doesn't
seem to be the fault of 9989d37d1 per se.  What my eye now falls on
is this, a bit further up in the build log [1]:

...
PreLinkEvent:
  Generate DEF file
  perl src\tools\msvc\gendef.pl Release\postgres x64
  :VCEnd
  Not re-generating POSTGRES.DEF, file already exists.
Link:
...

So it seems that the problem might really be a faulty rule in our
MSVC build script about when postgres.def needs to be regenerated?
Or else it's some weird caching problem on drongo --- the lack of
complaints from other Windows critters might point the finger
that way.

            regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2019-12-03%2007%3A30%3A01



Re: Using XLogFileNameP in critical section

From
Masahiko Sawada
Date:
On Tue, 3 Dec 2019 at 07:09, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Mon, Dec 02, 2019 at 10:14:58PM +0100, Masahiko Sawada wrote:
> > Agreed. I've attached the updated version patch. Please review it.
>
> Thanks, applied on HEAD after a few edits.  gettext() does not set
> errno, so the new style of issue_xlog_fsync() is actually fine.

Thanks!

> Please note that there was one mistake in the patch: you forgot to
> assign back errno in assign_xlog_sync_method() after generating the
> file name.

My bad. Thank you for fixing it.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Using XLogFileNameP in critical section

From
Michael Paquier
Date:
On Tue, Dec 03, 2019 at 11:24:57AM -0500, Tom Lane wrote:
> So it seems that the problem might really be a faulty rule in our
> MSVC build script about when postgres.def needs to be regenerated?
> Or else it's some weird caching problem on drongo --- the lack of
> complaints from other Windows critters might point the finger
> that way.

Yes, I saw the failure from the buildfarm logs, but I got to the
conclusion that the animal just got crazy with a portion of its
caching because there are no more references to routine removed.  So I
did not bother much.

FWIW, I have seen sometimes similar warnings regarding conflicting
projects when doing a compilation on MSVC, applying a patch and then
recompiling (because that's just faster than recompiling the whole).
Perhaps we could do better here.  I am not completely sure what
though, it's not a problem I have spent much brain-time on.
--
Michael

Attachment

Re: Using XLogFileNameP in critical section

From
Michael Paquier
Date:
On Tue, Dec 03, 2019 at 09:35:00AM -0300, Alvaro Herrera wrote:
> You didn't attach anything, but I concur about the low probability
> aspect: the assertion failure does not occur in production builds
> (obviously); and only an out-of-memory situation is a real problem
> when
> an fsync fails.  Anyway this should be a very localized fix, right?

Sorry.  You get something like the attached.  The recent refactoring
work you committed in this area causes already conflicts on
REL_12_STABLE.

> I'm not sure that the internationalization stuff in issue_xlog_fsync
> is correct.  I think the _() should be gettext_noop(), or alternatively
> the errmsg() should be errmsg_internal(); otherwise the translation is
> invoked twice.  (I didn't verify this.)

Hmm.  We actually do both in tablecmds.c:ATWrongRelkindError(), and
that's the code I was looking at yesterday when thinking about the
problem..  However, parse_agg.c, parse_expr.c and parse_func.c among
others like vacuumlazy.c use directly errmsg_internal() without
translating the string first.  So there is indeed duplicated work for
both.  Does the attached patch look correct to you?
--
Michael

Attachment

Re: Using XLogFileNameP in critical section

From
Andrew Dunstan
Date:
On 12/3/19 11:24 AM, Tom Lane wrote:
> I wrote:
>> Also, buildfarm member drongo is not happy:
>> postgres.def : error LNK2001: unresolved external symbol XLogFileNameP
[C:\prog\bf\root\HEAD\pgsql.build\postgres.vcxproj]
>> Release/postgres/postgres.lib : fatal error LNK1120: 1 unresolved externals
[C:\prog\bf\root\HEAD\pgsql.build\postgres.vcxproj]
>> I'm guessing you missed a reference someplace.
> Hm ... grep swears up and down that there is no remaining instance
> of the string "XLogFileNameP" anywhere in the tree.  So this doesn't
> seem to be the fault of 9989d37d1 per se.  What my eye now falls on
> is this, a bit further up in the build log [1]:
>
> ...
> PreLinkEvent:
>   Generate DEF file
>   perl src\tools\msvc\gendef.pl Release\postgres x64
>   :VCEnd
>   Not re-generating POSTGRES.DEF, file already exists.
> Link:
> ...
>
> So it seems that the problem might really be a faulty rule in our
> MSVC build script about when postgres.def needs to be regenerated?
> Or else it's some weird caching problem on drongo --- the lack of
> complaints from other Windows critters might point the finger
> that way.
>
>             regards, tom lane
>
> [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2019-12-03%2007%3A30%3A01
>


this was pilot error on my part. Should be fixed now.


cheers


andrew