Thread: Anti-critical-section assertion failure in mcxt.c reached by walsender

Anti-critical-section assertion failure in mcxt.c reached by walsender

From
Thomas Munro
Date:
Hi,

While looking for something else, I noticed thorntail has failed twice
like this, on REL_12_STABLE:

TRAP: FailedAssertion("!(CritSectionCount == 0 ||
(context)->allowInCritSection)", File:
"/home/nm/farm/sparc64_deb10_gcc_64_ubsan/REL_12_STABLE/pgsql.build/../pgsql/src/backend/utils/mmgr/mcxt.c",
Line: 931)

In both cases, the last thing that process said was:

2021-05-05 20:25:51.650 MSK [3442631:6] 008_fsm_truncation.pl LOG:
received replication command: BASE_BACKUP LABEL 'pg_basebackup base
backup'   FAST NOWAIT

Unfortunately there is no libbacktrace in that release, and for some
reason we don't see a core being analysed... (gdb not installed,
looking for wrong core file pattern, ...?)

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=thorntail&dt=2021-05-05%2017%3A08%3A00
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=thorntail&dt=2021-04-22%2016%3A59%3A01



Thomas Munro <thomas.munro@gmail.com> writes:
> While looking for something else, I noticed thorntail has failed twice
> like this, on REL_12_STABLE:
> TRAP: FailedAssertion("!(CritSectionCount == 0 ||
> (context)->allowInCritSection)", File:
> "/home/nm/farm/sparc64_deb10_gcc_64_ubsan/REL_12_STABLE/pgsql.build/../pgsql/src/backend/utils/mmgr/mcxt.c",
> Line: 931)

After failing to reproduce this locally, I went so far as to sign up
for a gcc compile farm account so I could try to reproduce it on the
machine running thorntail.  I succeeded, after more than a few tries,
and here is the smoking gun:

#3  0x00000100007f792c in ExceptionalCondition (
    conditionName=0x10000a38b80 "!(CritSectionCount == 0 || (context)->allowInCritSection)", errorType=0x1000087fb20
"FailedAssertion", 
    fileName=0x10000a38908 "mcxt.c", lineNumber=<optimized out>) at assert.c:54
#4  0x00000100008422f4 in palloc (size=64) at mcxt.c:931
#5  0x00000100001f5cec in XLogFileNameP (tli=<optimized out>, segno=1)
    at xlog.c:10209
#6  0x00000100001f6220 in issue_xlog_fsync (fd=<optimized out>, segno=1)
    at xlog.c:10186
#7  0x00000100001f6784 in XLogWrite (WriteRqst=..., flexible=<optimized out>)
    at xlog.c:2607
#8  0x00000100001f793c in XLogFlush (record=23717128) at xlog.c:2926
#9  XLogFlush (record=23717128) at xlog.c:2802
#10 0x00000100001fe71c in XLogReportParameters () at xlog.c:9525
#11 StartupXLOG () at xlog.c:7805
#12 0x0000010000552d30 in StartupProcessMain () at startup.c:226
#13 0x0000010000215c1c in AuxiliaryProcessMain (argc=2, argv=0x7feffdc2f80)
    at bootstrap.c:451

The interesting part of this is frame 6, which points here:

        case SYNC_METHOD_FDATASYNC:
            if (pg_fdatasync(fd) != 0)
                ereport(PANIC,
                        (errcode_for_file_access(),
                         errmsg("could not fdatasync file \"%s\": %m",
                                XLogFileNameP(ThisTimeLineID, segno))));

So fdatasync() failed, and the code attempting to report that is not
critical-section-safe because it includes a palloc.  Checking the state
of elog.c's error stack shows that the failure was errno = 5, or EIO.

Conclusions:

1. No wonder we could not reproduce it anywhere else.  I've warned
the cfarm admins that their machine may be having hardware issues.

2. We evidently need to put a bit more effort into this error
reporting logic.  More generally, I wonder how we could audit
the code for similar hazards elsewhere, because I bet there are
some.  (Or ... could it be sane to run functions included in
the ereport's arguments in ErrorContext?)

3. One might wonder why we're getting an fdatasync failure at
all, when thorntail is configured to run with fsync = off.
The answer to that one is that 008_fsm_truncation.pl takes it
upon itself to force fsync = on, overriding the express wishes
of the buildfarm owner, not to mention general project policy.
AFAICT that was added with little if any thought in the initial
creation of 008_fsm_truncation.pl, and I think we should take
it out.  There's certainly no visible reason for this one
TAP script to be running with fsync on when no others do.

> Unfortunately there is no libbacktrace in that release, and for some
> reason we don't see a core being analysed... (gdb not installed,
> looking for wrong core file pattern, ...?)

That I'm not sure about.  gdb is certainly installed, and thorntail is
visibly running the current buildfarm client and is configured with the
correct core_file_glob, and I can report that the crash did leave a 'core'
file in the data directory (so it's not a case of systemd commandeering
the core dump).  Seems like core-file collection should've worked
... unless maybe it's not covering TAP tests at all?

            regards, tom lane



On Thu, May 06, 2021 at 09:43:32PM -0400, Tom Lane wrote:
> 2. We evidently need to put a bit more effort into this error
> reporting logic.  More generally, I wonder how we could audit
> the code for similar hazards elsewhere, because I bet there are
> some.  (Or ... could it be sane to run functions included in
> the ereport's arguments in ErrorContext?)

Seems reasonable.  I don't have good ideas for auditing; just making the
palloc work may be easier.

> 3. One might wonder why we're getting an fdatasync failure at
> all, when thorntail is configured to run with fsync = off.
> The answer to that one is that 008_fsm_truncation.pl takes it
> upon itself to force fsync = on, overriding the express wishes
> of the buildfarm owner, not to mention general project policy.
> AFAICT that was added with little if any thought in the initial
> creation of 008_fsm_truncation.pl, and I think we should take
> it out.  There's certainly no visible reason for this one
> TAP script to be running with fsync on when no others do.

I've caught that one test taking ~10min due to its fsync use.  If fsync=on
isn't important to the test, +1 for removing it.



Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

From
Andres Freund
Date:
Hi,

On 2021-05-06 21:43:32 -0400, Tom Lane wrote:
> 2. We evidently need to put a bit more effort into this error
> reporting logic.  More generally, I wonder how we could audit
> the code for similar hazards elsewhere, because I bet there are
> some.  (Or ... could it be sane to run functions included in
> the ereport's arguments in ErrorContext?)

I have wondered about that before myself. It's pretty awkward to solve
these kind of things at the caller level, and we have a perfectly good
context to do this in, that we know is going to be reset. However - we
don't reset ErrorContext for DEBUG messages, I believe. So there'd be a
noticeable increase in leaking into ErrorContext, unless we change how
we do that?

I guess I could see only switching to another memory context for >=
ERROR, but it does seem a bit odd. But for PANIC etc it's quite annoying
to loose the actual error message on the buildfarm.


> > Unfortunately there is no libbacktrace in that release, and for some
> > reason we don't see a core being analysed... (gdb not installed,
> > looking for wrong core file pattern, ...?)
> 
> That I'm not sure about.  gdb is certainly installed, and thorntail is
> visibly running the current buildfarm client and is configured with the
> correct core_file_glob, and I can report that the crash did leave a 'core'
> file in the data directory (so it's not a case of systemd commandeering
> the core dump).  Seems like core-file collection should've worked
> ... unless maybe it's not covering TAP tests at all?

I suspect that is it - there's not really a good way for the buildfarm
client to even know where there could be data directories :(.

Greetings,

Andres Freund



Andres Freund <andres@anarazel.de> writes:
> On 2021-05-06 21:43:32 -0400, Tom Lane wrote:
>> That I'm not sure about.  gdb is certainly installed, and thorntail is
>> visibly running the current buildfarm client and is configured with the
>> correct core_file_glob, and I can report that the crash did leave a 'core'
>> file in the data directory (so it's not a case of systemd commandeering
>> the core dump).  Seems like core-file collection should've worked
>> ... unless maybe it's not covering TAP tests at all?

> I suspect that is it - there's not really a good way for the buildfarm
> client to even know where there could be data directories :(.

Does it need to?  I'm envisioning "find tmp_check -name '$core_file_glob'"
or something along that line.

            regards, tom lane



Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

From
Andres Freund
Date:
Hi,

On 2021-05-07 00:30:11 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2021-05-06 21:43:32 -0400, Tom Lane wrote:
> >> That I'm not sure about.  gdb is certainly installed, and thorntail is
> >> visibly running the current buildfarm client and is configured with the
> >> correct core_file_glob, and I can report that the crash did leave a 'core'
> >> file in the data directory (so it's not a case of systemd commandeering
> >> the core dump).  Seems like core-file collection should've worked
> >> ... unless maybe it's not covering TAP tests at all?
> 
> > I suspect that is it - there's not really a good way for the buildfarm
> > client to even know where there could be data directories :(.
> 
> Does it need to?  I'm envisioning "find tmp_check -name '$core_file_glob'"
> or something along that line.

Yea, it'd be doable that way. It'd be a bit harder to associate the core
files with specific tests though. But I now checked, and it indeed
checks for core files in a specific subset of tests, and that that test
only globs inside the passed-in datadir.

sub get_stack_trace
{
        return get_stack_trace_cygwin(@_) if ($ENV{CYGWIN});

        my $bindir = shift;
        my $pgdata = shift;

        # no core = no result
        my @cores = glob("$pgdata/$core_file_glob");
        return () unless @cores;
...

andres@awork3:~/src/pgbuildfarm-client$ ack get_stack_trace
run_build.pl
1601:          get_stack_trace("$installdir/bin", "$installdir/data-$locale");
1637:          get_stack_trace("$installdir/bin", "$installdir/data-$locale");
1711:        my @trace = get_stack_trace("$installdir/bin", "$installdir/data");
1750:          get_stack_trace("$installdir/bin", "$installdir/data-$locale");
1798:          get_stack_trace("$installdir/bin", "$installdir/data-$locale");
2010:        my @trace = get_stack_trace("$binloc$installdir/bin", "$base/data");
2061:          get_stack_trace("$base/install$installdir/bin", "$base/data");

PGBuild/Utils.pm
30:  get_stack_trace cleanlogs writelog
148:sub get_stack_trace_cygwin
173:sub get_stack_trace
175:    return get_stack_trace_cygwin(@_) if ($ENV{CYGWIN});

PGBuild/Modules/RedisFDW.pm
214:          get_stack_trace("$installdir/bin", "$installdir/data-$locale");

PGBuild/Modules/TestUpgrade.pm
148:        my @trace = get_stack_trace("$tmp_bin_dir", "$tmp_data_dir");

PGBuild/Modules/TestCollateLinuxUTF8.pm
122:          get_stack_trace("$installdir/bin", "$installdir/data-$locale");

PGBuild/Modules/TestICU.pm
110:          get_stack_trace("$installdir/bin", "$installdir/data-$locale");


Greetings,

Andres Freund



On Fri, May 7, 2021 at 1:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The interesting part of this is frame 6, which points here:
>
>         case SYNC_METHOD_FDATASYNC:
>             if (pg_fdatasync(fd) != 0)
>                 ereport(PANIC,
>                         (errcode_for_file_access(),
>                          errmsg("could not fdatasync file \"%s\": %m",
>                                 XLogFileNameP(ThisTimeLineID, segno))));

Oh, and I see that 13 has 9989d37d "Remove XLogFileNameP() from the
tree" to fix this exact problem.



Thomas Munro <thomas.munro@gmail.com> writes:
> On Fri, May 7, 2021 at 1:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The interesting part of this is frame 6, which points here:

> Oh, and I see that 13 has 9989d37d "Remove XLogFileNameP() from the
> tree" to fix this exact problem.

Hah, so that maybe explains why thorntail has only shown this in
the v12 branch.  Should we consider back-patching that?

The more general issue of how to detect, or else make safe,
pallocs in critical error reports remains.  But I bet any answer
we think of for that will not be back-patchable.  So back-patching
a localized fix for the specific bug we know of might be worth doing.

            regards, tom lane



I wrote:
> 1. No wonder we could not reproduce it anywhere else.  I've warned
> the cfarm admins that their machine may be having hardware issues.

I heard back from the machine's admin.  The time of the crash I observed
matches exactly to these events in the kernel log:

May 07 03:31:39 gcc202 kernel: dm-0: writeback error on inode 2148294407, offset 0, sector 159239256
May 07 03:31:39 gcc202 kernel: sunvdc: vdc_tx_trigger() failure, err=-11
May 07 03:31:39 gcc202 kernel: blk_update_request: I/O error, dev vdiskc, sector 157618896 op 0x1:(WRITE) flags 0x4800
phys_seg16 prio class 0 

So it's not a mirage.  The admin seems to think it might be a kernel
bug though.

            regards, tom lane



Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

From
Andrew Dunstan
Date:
On 5/7/21 12:38 AM, Andres Freund wrote:
> Hi,
>
> On 2021-05-07 00:30:11 -0400, Tom Lane wrote:
>> Andres Freund <andres@anarazel.de> writes:
>>> On 2021-05-06 21:43:32 -0400, Tom Lane wrote:
>>>> That I'm not sure about.  gdb is certainly installed, and thorntail is
>>>> visibly running the current buildfarm client and is configured with the
>>>> correct core_file_glob, and I can report that the crash did leave a 'core'
>>>> file in the data directory (so it's not a case of systemd commandeering
>>>> the core dump).  Seems like core-file collection should've worked
>>>> ... unless maybe it's not covering TAP tests at all?
>>> I suspect that is it - there's not really a good way for the buildfarm
>>> client to even know where there could be data directories :(.
>> Does it need to?  I'm envisioning "find tmp_check -name '$core_file_glob'"
>> or something along that line.
> Yea, it'd be doable that way. It'd be a bit harder to associate the core
> files with specific tests though. But I now checked, and it indeed
> checks for core files in a specific subset of tests, and that that test
> only globs inside the passed-in datadir.
>

working on it ...


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com




I wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
>> Oh, and I see that 13 has 9989d37d "Remove XLogFileNameP() from the
>> tree" to fix this exact problem.

> Hah, so that maybe explains why thorntail has only shown this in
> the v12 branch.  Should we consider back-patching that?

Realizing that 9989d37d prevents the assertion failure, I went
to see if thorntail had shown EIO failures without assertions.
Looking back 180 days, I found these:

  sysname  |    branch     |      snapshot       |       stage        |
                     l                                                                         

-----------+---------------+---------------------+--------------------+------------------------------------------------------------------------------------------------------------------------------------------------
 thorntail | HEAD          | 2021-03-19 21:28:15 | recoveryCheck      | 2021-03-20 00:48:48.117 MSK [4089174:11]
008_fsm_truncation.plPANIC:  could not fdatasync file "000000010000000000000002": Input/output error 
 thorntail | HEAD          | 2021-04-06 16:08:10 | recoveryCheck      | 2021-04-06 19:30:54.103 MSK [3355008:11]
008_fsm_truncation.plPANIC:  could not fdatasync file "000000010000000000000002": Input/output error 
 thorntail | REL9_6_STABLE | 2021-04-12 02:38:04 | pg_basebackupCheck | pg_basebackup: could not fsync file
"000000010000000000000013":Input/output error 

So indeed the kernel-or-hardware problem is affecting other branches.
I suspect that the lack of reports in the pre-v12 branches is mostly
down to there having been many fewer runs on those branches within
the past couple months.

            regards, tom lane



Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

From
Andrew Dunstan
Date:
On 5/7/21 11:27 AM, Andrew Dunstan wrote:
> On 5/7/21 12:38 AM, Andres Freund wrote:
>> Hi,
>>
>> On 2021-05-07 00:30:11 -0400, Tom Lane wrote:
>>> Andres Freund <andres@anarazel.de> writes:
>>>> On 2021-05-06 21:43:32 -0400, Tom Lane wrote:
>>>>> That I'm not sure about.  gdb is certainly installed, and thorntail is
>>>>> visibly running the current buildfarm client and is configured with the
>>>>> correct core_file_glob, and I can report that the crash did leave a 'core'
>>>>> file in the data directory (so it's not a case of systemd commandeering
>>>>> the core dump).  Seems like core-file collection should've worked
>>>>> ... unless maybe it's not covering TAP tests at all?
>>>> I suspect that is it - there's not really a good way for the buildfarm
>>>> client to even know where there could be data directories :(.
>>> Does it need to?  I'm envisioning "find tmp_check -name '$core_file_glob'"
>>> or something along that line.
>> Yea, it'd be doable that way. It'd be a bit harder to associate the core
>> files with specific tests though. But I now checked, and it indeed
>> checks for core files in a specific subset of tests, and that that test
>> only globs inside the passed-in datadir.
>>
> working on it ...
>
>
> cheers
>
>


see
<https://github.com/PGBuildFarm/client-code/commit/5361bb5ff01ee47e8998f5f8f44732ccc5a01183>


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

From
Andres Freund
Date:
Hi,

On 2021-05-07 10:29:58 -0400, Tom Lane wrote:
> I wrote:
> > 1. No wonder we could not reproduce it anywhere else.  I've warned
> > the cfarm admins that their machine may be having hardware issues.
> 
> I heard back from the machine's admin.  The time of the crash I observed
> matches exactly to these events in the kernel log:
> 
> May 07 03:31:39 gcc202 kernel: dm-0: writeback error on inode 2148294407, offset 0, sector 159239256
> May 07 03:31:39 gcc202 kernel: sunvdc: vdc_tx_trigger() failure, err=-11
> May 07 03:31:39 gcc202 kernel: blk_update_request: I/O error, dev vdiskc, sector 157618896 op 0x1:(WRITE) flags
0x4800phys_seg 16 prio class 0
 
> 
> So it's not a mirage.  The admin seems to think it might be a kernel
> bug though.

Isn't this a good reason to have at least some tests run with fsync=on?

It makes a ton of sense for buildfarm animals to disable fsync to
achieve acceptable performance. Having something in there that
nevertheless does some light exercise of the fsync code doesn't seem
bad?

Greetings,

Andres Freund



Andres Freund <andres@anarazel.de> writes:
> Isn't this a good reason to have at least some tests run with fsync=on?

Why?

I can certainly see an argument for running some buildfarm animals
with fsync on (for all tests).  I don't see a reason for forcing
them all to run some tests that way; and if I were going to do that,
I doubt that 008_fsm_truncation.pl would be the one I would pick.
I think it's nothing but sloppiness that that one is out of step with
all the rest.

IMO, if a buildfarm owner sets fsync = off, they mean off.
They don't mean "maybe".

            regards, tom lane



On Fri, May 07, 2021 at 01:18:19PM -0400, Tom Lane wrote:
> Realizing that 9989d37d prevents the assertion failure, I went
> to see if thorntail had shown EIO failures without assertions.
> Looking back 180 days, I found these:
> 
>   sysname  |    branch     |      snapshot       |       stage        |
                       l                                                                        
 
>
-----------+---------------+---------------------+--------------------+------------------------------------------------------------------------------------------------------------------------------------------------
>  thorntail | HEAD          | 2021-03-19 21:28:15 | recoveryCheck      | 2021-03-20 00:48:48.117 MSK [4089174:11]
008_fsm_truncation.plPANIC:  could not fdatasync file "000000010000000000000002": Input/output error
 
>  thorntail | HEAD          | 2021-04-06 16:08:10 | recoveryCheck      | 2021-04-06 19:30:54.103 MSK [3355008:11]
008_fsm_truncation.plPANIC:  could not fdatasync file "000000010000000000000002": Input/output error
 
>  thorntail | REL9_6_STABLE | 2021-04-12 02:38:04 | pg_basebackupCheck | pg_basebackup: could not fsync file
"000000010000000000000013":Input/output error
 
> 
> So indeed the kernel-or-hardware problem is affecting other branches.

Having a flaky buildfarm member is bad news.  I'll LD_PRELOAD the attached to
prevent fsync from reaching the kernel.  Hopefully, that will make the
hardware-or-kernel trouble unreachable.  (Changing 008_fsm_truncation.pl
wouldn't avoid this, because fsync=off doesn't affect syncs outside the
backend.)

Attachment

Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

From
Michael Paquier
Date:
On Fri, May 07, 2021 at 04:30:00PM -0400, Tom Lane wrote:
> I can certainly see an argument for running some buildfarm animals
> with fsync on (for all tests).  I don't see a reason for forcing
> them all to run some tests that way; and if I were going to do that,
> I doubt that 008_fsm_truncation.pl would be the one I would pick.
> I think it's nothing but sloppiness that that one is out of step with
> all the rest.

My take on this point is that using the configuration that can be
enforced for each animal would be enough.  I manage a small animal and
this stuff can take a while to flush some data.

Worth noting that using fsync=on has not been discussed on the
original thread, and I don't see why that's necessary:
https://www.postgresql.org/message-id/flat/CABOikdNr5vKucqyZH9s1Mh0XebLs_jRhKv6eJfNnD2wxTn%3D_9A%40mail.gmail.com
So I would vote for removing it in this case.
--
Michael

Attachment

Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

From
Michael Paquier
Date:
On Fri, May 07, 2021 at 04:42:46PM +1200, Thomas Munro wrote:
> Oh, and I see that 13 has 9989d37d "Remove XLogFileNameP() from the
> tree" to fix this exact problem.

I don't see that we'd be able to get a redesign of this area safe
enough for a backpatch, but perhaps we (I?) had better put some extra
effort in back-patching this commit while keeping XLogFileNameP()
around for compatibility?  How do people feel about that?
--
Michael

Attachment

Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

From
Andres Freund
Date:
Hi,

On 2021-05-07 17:14:18 -0700, Noah Misch wrote:
> Having a flaky buildfarm member is bad news.  I'll LD_PRELOAD the attached to
> prevent fsync from reaching the kernel.  Hopefully, that will make the
> hardware-or-kernel trouble unreachable.  (Changing 008_fsm_truncation.pl
> wouldn't avoid this, because fsync=off doesn't affect syncs outside the
> backend.)

Not sure how reliable that is - there's other paths that could return an
error, I think. If the root cause is the disk responding weirdly to
write cache flushes, you could tell the kernel that that the disk has no
write cache (e.g. echo write through > /sys/block/sda/queue/write_cache).

Greetings,

Andres Freund



Andres Freund <andres@anarazel.de> writes:
> On 2021-05-07 17:14:18 -0700, Noah Misch wrote:
>> Having a flaky buildfarm member is bad news.  I'll LD_PRELOAD the attached to
>> prevent fsync from reaching the kernel.  Hopefully, that will make the
>> hardware-or-kernel trouble unreachable.  (Changing 008_fsm_truncation.pl
>> wouldn't avoid this, because fsync=off doesn't affect syncs outside the
>> backend.)

> Not sure how reliable that is - there's other paths that could return an
> error, I think. If the root cause is the disk responding weirdly to
> write cache flushes, you could tell the kernel that that the disk has no
> write cache (e.g. echo write through > /sys/block/sda/queue/write_cache).

I seriously doubt Noah has root on that machine.

More to the point, the admin told me it's a VM (or LDOM, whatever that is)
under a Solaris host, so there's no direct hardware access going on
anyway.  He didn't say in so many words, but I suspect the reason he's
suspecting kernel bugs is that there's nothing going wrong so far as the
host OS is concerned.

            regards, tom lane



On Fri, May 07, 2021 at 10:18:14PM -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2021-05-07 17:14:18 -0700, Noah Misch wrote:
> >> Having a flaky buildfarm member is bad news.  I'll LD_PRELOAD the attached to
> >> prevent fsync from reaching the kernel.  Hopefully, that will make the
> >> hardware-or-kernel trouble unreachable.  (Changing 008_fsm_truncation.pl
> >> wouldn't avoid this, because fsync=off doesn't affect syncs outside the
> >> backend.)
> 
> > Not sure how reliable that is - there's other paths that could return an
> > error, I think.

Yep, one can imagine a failure at close() or something.  All the non-HEAD
buildfarm failures are at some *sync call, so I'm optimistic about getting
mileage from this.  (I didn't check the more-numerous HEAD failures.)  If it's
not enough, I may move the farm directory to tmpfs.

> > If the root cause is the disk responding weirdly to
> > write cache flushes, you could tell the kernel that that the disk has no
> > write cache (e.g. echo write through > /sys/block/sda/queue/write_cache).
> 
> I seriously doubt Noah has root on that machine.

If I can make the case for that setting being a good thing for the VM's users
generally, I probably can file a ticket and get it done.

> More to the point, the admin told me it's a VM (or LDOM, whatever that is)
> under a Solaris host, so there's no direct hardware access going on
> anyway.  He didn't say in so many words, but I suspect the reason he's
> suspecting kernel bugs is that there's nothing going wrong so far as the
> host OS is concerned.



On Sat, May 8, 2021 at 2:30 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> May 07 03:31:39 gcc202 kernel: sunvdc: vdc_tx_trigger() failure, err=-11

That's -EAGAIN (assuming errnos match x86) and I guess it indicates
that VDC_MAX_RETRIES is exceeded here:

https://github.com/torvalds/linux/blob/master/drivers/block/sunvdc.c#L451
https://github.com/torvalds/linux/blob/master/drivers/block/sunvdc.c#L526

One theory is that the hypervisor/host is occasionally too swamped to
service the request queue fast enough over a ~10ms period, given that
vio_ldc_send() itself retries 1000 times with a 1us sleep, the outer
loop tries ten times, and ldc.c's write_nonraw() reports -EAGAIN when
there is no space for the message.  (Alternatively, it's trying to
send a message that's too big for the channel, the channel is
corrupted by bugs, or my fly-by of this code I'd never heard of before
now is just way off...)



On Sat, May 08, 2021 at 04:57:54PM +1200, Thomas Munro wrote:
> On Sat, May 8, 2021 at 2:30 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > May 07 03:31:39 gcc202 kernel: sunvdc: vdc_tx_trigger() failure, err=-11
> 
> That's -EAGAIN (assuming errnos match x86) and I guess it indicates
> that VDC_MAX_RETRIES is exceeded here:
> 
> https://github.com/torvalds/linux/blob/master/drivers/block/sunvdc.c#L451
> https://github.com/torvalds/linux/blob/master/drivers/block/sunvdc.c#L526
> 
> One theory is that the hypervisor/host is occasionally too swamped to
> service the request queue fast enough over a ~10ms period, given that
> vio_ldc_send() itself retries 1000 times with a 1us sleep, the outer
> loop tries ten times, and ldc.c's write_nonraw() reports -EAGAIN when
> there is no space for the message.  (Alternatively, it's trying to
> send a message that's too big for the channel, the channel is
> corrupted by bugs, or my fly-by of this code I'd never heard of before
> now is just way off...)

Nice discovery.  From
https://github.com/torvalds/linux/commit/a11f6ca9aef989b56cd31ff4ee2af4fb31a172ec
I see those details are 2.5 years old, somewhat young relative to the driver
as a whole.  I don't know which part should change, though.