Thread: Anti-critical-section assertion failure in mcxt.c reached by walsender
Hi, While looking for something else, I noticed thorntail has failed twice like this, on REL_12_STABLE: TRAP: FailedAssertion("!(CritSectionCount == 0 || (context)->allowInCritSection)", File: "/home/nm/farm/sparc64_deb10_gcc_64_ubsan/REL_12_STABLE/pgsql.build/../pgsql/src/backend/utils/mmgr/mcxt.c", Line: 931) In both cases, the last thing that process said was: 2021-05-05 20:25:51.650 MSK [3442631:6] 008_fsm_truncation.pl LOG: received replication command: BASE_BACKUP LABEL 'pg_basebackup base backup' FAST NOWAIT Unfortunately there is no libbacktrace in that release, and for some reason we don't see a core being analysed... (gdb not installed, looking for wrong core file pattern, ...?) https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=thorntail&dt=2021-05-05%2017%3A08%3A00 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=thorntail&dt=2021-04-22%2016%3A59%3A01
Thomas Munro <thomas.munro@gmail.com> writes: > While looking for something else, I noticed thorntail has failed twice > like this, on REL_12_STABLE: > TRAP: FailedAssertion("!(CritSectionCount == 0 || > (context)->allowInCritSection)", File: > "/home/nm/farm/sparc64_deb10_gcc_64_ubsan/REL_12_STABLE/pgsql.build/../pgsql/src/backend/utils/mmgr/mcxt.c", > Line: 931) After failing to reproduce this locally, I went so far as to sign up for a gcc compile farm account so I could try to reproduce it on the machine running thorntail. I succeeded, after more than a few tries, and here is the smoking gun: #3 0x00000100007f792c in ExceptionalCondition ( conditionName=0x10000a38b80 "!(CritSectionCount == 0 || (context)->allowInCritSection)", errorType=0x1000087fb20 "FailedAssertion", fileName=0x10000a38908 "mcxt.c", lineNumber=<optimized out>) at assert.c:54 #4 0x00000100008422f4 in palloc (size=64) at mcxt.c:931 #5 0x00000100001f5cec in XLogFileNameP (tli=<optimized out>, segno=1) at xlog.c:10209 #6 0x00000100001f6220 in issue_xlog_fsync (fd=<optimized out>, segno=1) at xlog.c:10186 #7 0x00000100001f6784 in XLogWrite (WriteRqst=..., flexible=<optimized out>) at xlog.c:2607 #8 0x00000100001f793c in XLogFlush (record=23717128) at xlog.c:2926 #9 XLogFlush (record=23717128) at xlog.c:2802 #10 0x00000100001fe71c in XLogReportParameters () at xlog.c:9525 #11 StartupXLOG () at xlog.c:7805 #12 0x0000010000552d30 in StartupProcessMain () at startup.c:226 #13 0x0000010000215c1c in AuxiliaryProcessMain (argc=2, argv=0x7feffdc2f80) at bootstrap.c:451 The interesting part of this is frame 6, which points here: case SYNC_METHOD_FDATASYNC: if (pg_fdatasync(fd) != 0) ereport(PANIC, (errcode_for_file_access(), errmsg("could not fdatasync file \"%s\": %m", XLogFileNameP(ThisTimeLineID, segno)))); So fdatasync() failed, and the code attempting to report that is not critical-section-safe because it includes a palloc. Checking the state of elog.c's error stack shows that the failure was errno = 5, or EIO. Conclusions: 1. No wonder we could not reproduce it anywhere else. I've warned the cfarm admins that their machine may be having hardware issues. 2. We evidently need to put a bit more effort into this error reporting logic. More generally, I wonder how we could audit the code for similar hazards elsewhere, because I bet there are some. (Or ... could it be sane to run functions included in the ereport's arguments in ErrorContext?) 3. One might wonder why we're getting an fdatasync failure at all, when thorntail is configured to run with fsync = off. The answer to that one is that 008_fsm_truncation.pl takes it upon itself to force fsync = on, overriding the express wishes of the buildfarm owner, not to mention general project policy. AFAICT that was added with little if any thought in the initial creation of 008_fsm_truncation.pl, and I think we should take it out. There's certainly no visible reason for this one TAP script to be running with fsync on when no others do. > Unfortunately there is no libbacktrace in that release, and for some > reason we don't see a core being analysed... (gdb not installed, > looking for wrong core file pattern, ...?) That I'm not sure about. gdb is certainly installed, and thorntail is visibly running the current buildfarm client and is configured with the correct core_file_glob, and I can report that the crash did leave a 'core' file in the data directory (so it's not a case of systemd commandeering the core dump). Seems like core-file collection should've worked ... unless maybe it's not covering TAP tests at all? regards, tom lane
On Thu, May 06, 2021 at 09:43:32PM -0400, Tom Lane wrote: > 2. We evidently need to put a bit more effort into this error > reporting logic. More generally, I wonder how we could audit > the code for similar hazards elsewhere, because I bet there are > some. (Or ... could it be sane to run functions included in > the ereport's arguments in ErrorContext?) Seems reasonable. I don't have good ideas for auditing; just making the palloc work may be easier. > 3. One might wonder why we're getting an fdatasync failure at > all, when thorntail is configured to run with fsync = off. > The answer to that one is that 008_fsm_truncation.pl takes it > upon itself to force fsync = on, overriding the express wishes > of the buildfarm owner, not to mention general project policy. > AFAICT that was added with little if any thought in the initial > creation of 008_fsm_truncation.pl, and I think we should take > it out. There's certainly no visible reason for this one > TAP script to be running with fsync on when no others do. I've caught that one test taking ~10min due to its fsync use. If fsync=on isn't important to the test, +1 for removing it.
Hi, On 2021-05-06 21:43:32 -0400, Tom Lane wrote: > 2. We evidently need to put a bit more effort into this error > reporting logic. More generally, I wonder how we could audit > the code for similar hazards elsewhere, because I bet there are > some. (Or ... could it be sane to run functions included in > the ereport's arguments in ErrorContext?) I have wondered about that before myself. It's pretty awkward to solve these kind of things at the caller level, and we have a perfectly good context to do this in, that we know is going to be reset. However - we don't reset ErrorContext for DEBUG messages, I believe. So there'd be a noticeable increase in leaking into ErrorContext, unless we change how we do that? I guess I could see only switching to another memory context for >= ERROR, but it does seem a bit odd. But for PANIC etc it's quite annoying to loose the actual error message on the buildfarm. > > Unfortunately there is no libbacktrace in that release, and for some > > reason we don't see a core being analysed... (gdb not installed, > > looking for wrong core file pattern, ...?) > > That I'm not sure about. gdb is certainly installed, and thorntail is > visibly running the current buildfarm client and is configured with the > correct core_file_glob, and I can report that the crash did leave a 'core' > file in the data directory (so it's not a case of systemd commandeering > the core dump). Seems like core-file collection should've worked > ... unless maybe it's not covering TAP tests at all? I suspect that is it - there's not really a good way for the buildfarm client to even know where there could be data directories :(. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2021-05-06 21:43:32 -0400, Tom Lane wrote: >> That I'm not sure about. gdb is certainly installed, and thorntail is >> visibly running the current buildfarm client and is configured with the >> correct core_file_glob, and I can report that the crash did leave a 'core' >> file in the data directory (so it's not a case of systemd commandeering >> the core dump). Seems like core-file collection should've worked >> ... unless maybe it's not covering TAP tests at all? > I suspect that is it - there's not really a good way for the buildfarm > client to even know where there could be data directories :(. Does it need to? I'm envisioning "find tmp_check -name '$core_file_glob'" or something along that line. regards, tom lane
Hi, On 2021-05-07 00:30:11 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2021-05-06 21:43:32 -0400, Tom Lane wrote: > >> That I'm not sure about. gdb is certainly installed, and thorntail is > >> visibly running the current buildfarm client and is configured with the > >> correct core_file_glob, and I can report that the crash did leave a 'core' > >> file in the data directory (so it's not a case of systemd commandeering > >> the core dump). Seems like core-file collection should've worked > >> ... unless maybe it's not covering TAP tests at all? > > > I suspect that is it - there's not really a good way for the buildfarm > > client to even know where there could be data directories :(. > > Does it need to? I'm envisioning "find tmp_check -name '$core_file_glob'" > or something along that line. Yea, it'd be doable that way. It'd be a bit harder to associate the core files with specific tests though. But I now checked, and it indeed checks for core files in a specific subset of tests, and that that test only globs inside the passed-in datadir. sub get_stack_trace { return get_stack_trace_cygwin(@_) if ($ENV{CYGWIN}); my $bindir = shift; my $pgdata = shift; # no core = no result my @cores = glob("$pgdata/$core_file_glob"); return () unless @cores; ... andres@awork3:~/src/pgbuildfarm-client$ ack get_stack_trace run_build.pl 1601: get_stack_trace("$installdir/bin", "$installdir/data-$locale"); 1637: get_stack_trace("$installdir/bin", "$installdir/data-$locale"); 1711: my @trace = get_stack_trace("$installdir/bin", "$installdir/data"); 1750: get_stack_trace("$installdir/bin", "$installdir/data-$locale"); 1798: get_stack_trace("$installdir/bin", "$installdir/data-$locale"); 2010: my @trace = get_stack_trace("$binloc$installdir/bin", "$base/data"); 2061: get_stack_trace("$base/install$installdir/bin", "$base/data"); PGBuild/Utils.pm 30: get_stack_trace cleanlogs writelog 148:sub get_stack_trace_cygwin 173:sub get_stack_trace 175: return get_stack_trace_cygwin(@_) if ($ENV{CYGWIN}); PGBuild/Modules/RedisFDW.pm 214: get_stack_trace("$installdir/bin", "$installdir/data-$locale"); PGBuild/Modules/TestUpgrade.pm 148: my @trace = get_stack_trace("$tmp_bin_dir", "$tmp_data_dir"); PGBuild/Modules/TestCollateLinuxUTF8.pm 122: get_stack_trace("$installdir/bin", "$installdir/data-$locale"); PGBuild/Modules/TestICU.pm 110: get_stack_trace("$installdir/bin", "$installdir/data-$locale"); Greetings, Andres Freund
On Fri, May 7, 2021 at 1:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > The interesting part of this is frame 6, which points here: > > case SYNC_METHOD_FDATASYNC: > if (pg_fdatasync(fd) != 0) > ereport(PANIC, > (errcode_for_file_access(), > errmsg("could not fdatasync file \"%s\": %m", > XLogFileNameP(ThisTimeLineID, segno)))); Oh, and I see that 13 has 9989d37d "Remove XLogFileNameP() from the tree" to fix this exact problem.
Thomas Munro <thomas.munro@gmail.com> writes: > On Fri, May 7, 2021 at 1:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The interesting part of this is frame 6, which points here: > Oh, and I see that 13 has 9989d37d "Remove XLogFileNameP() from the > tree" to fix this exact problem. Hah, so that maybe explains why thorntail has only shown this in the v12 branch. Should we consider back-patching that? The more general issue of how to detect, or else make safe, pallocs in critical error reports remains. But I bet any answer we think of for that will not be back-patchable. So back-patching a localized fix for the specific bug we know of might be worth doing. regards, tom lane
I wrote: > 1. No wonder we could not reproduce it anywhere else. I've warned > the cfarm admins that their machine may be having hardware issues. I heard back from the machine's admin. The time of the crash I observed matches exactly to these events in the kernel log: May 07 03:31:39 gcc202 kernel: dm-0: writeback error on inode 2148294407, offset 0, sector 159239256 May 07 03:31:39 gcc202 kernel: sunvdc: vdc_tx_trigger() failure, err=-11 May 07 03:31:39 gcc202 kernel: blk_update_request: I/O error, dev vdiskc, sector 157618896 op 0x1:(WRITE) flags 0x4800 phys_seg16 prio class 0 So it's not a mirage. The admin seems to think it might be a kernel bug though. regards, tom lane
Re: Anti-critical-section assertion failure in mcxt.c reached by walsender
From
Andrew Dunstan
Date:
On 5/7/21 12:38 AM, Andres Freund wrote: > Hi, > > On 2021-05-07 00:30:11 -0400, Tom Lane wrote: >> Andres Freund <andres@anarazel.de> writes: >>> On 2021-05-06 21:43:32 -0400, Tom Lane wrote: >>>> That I'm not sure about. gdb is certainly installed, and thorntail is >>>> visibly running the current buildfarm client and is configured with the >>>> correct core_file_glob, and I can report that the crash did leave a 'core' >>>> file in the data directory (so it's not a case of systemd commandeering >>>> the core dump). Seems like core-file collection should've worked >>>> ... unless maybe it's not covering TAP tests at all? >>> I suspect that is it - there's not really a good way for the buildfarm >>> client to even know where there could be data directories :(. >> Does it need to? I'm envisioning "find tmp_check -name '$core_file_glob'" >> or something along that line. > Yea, it'd be doable that way. It'd be a bit harder to associate the core > files with specific tests though. But I now checked, and it indeed > checks for core files in a specific subset of tests, and that that test > only globs inside the passed-in datadir. > working on it ... cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
I wrote: > Thomas Munro <thomas.munro@gmail.com> writes: >> Oh, and I see that 13 has 9989d37d "Remove XLogFileNameP() from the >> tree" to fix this exact problem. > Hah, so that maybe explains why thorntail has only shown this in > the v12 branch. Should we consider back-patching that? Realizing that 9989d37d prevents the assertion failure, I went to see if thorntail had shown EIO failures without assertions. Looking back 180 days, I found these: sysname | branch | snapshot | stage | l -----------+---------------+---------------------+--------------------+------------------------------------------------------------------------------------------------------------------------------------------------ thorntail | HEAD | 2021-03-19 21:28:15 | recoveryCheck | 2021-03-20 00:48:48.117 MSK [4089174:11] 008_fsm_truncation.plPANIC: could not fdatasync file "000000010000000000000002": Input/output error thorntail | HEAD | 2021-04-06 16:08:10 | recoveryCheck | 2021-04-06 19:30:54.103 MSK [3355008:11] 008_fsm_truncation.plPANIC: could not fdatasync file "000000010000000000000002": Input/output error thorntail | REL9_6_STABLE | 2021-04-12 02:38:04 | pg_basebackupCheck | pg_basebackup: could not fsync file "000000010000000000000013":Input/output error So indeed the kernel-or-hardware problem is affecting other branches. I suspect that the lack of reports in the pre-v12 branches is mostly down to there having been many fewer runs on those branches within the past couple months. regards, tom lane
Re: Anti-critical-section assertion failure in mcxt.c reached by walsender
From
Andrew Dunstan
Date:
On 5/7/21 11:27 AM, Andrew Dunstan wrote: > On 5/7/21 12:38 AM, Andres Freund wrote: >> Hi, >> >> On 2021-05-07 00:30:11 -0400, Tom Lane wrote: >>> Andres Freund <andres@anarazel.de> writes: >>>> On 2021-05-06 21:43:32 -0400, Tom Lane wrote: >>>>> That I'm not sure about. gdb is certainly installed, and thorntail is >>>>> visibly running the current buildfarm client and is configured with the >>>>> correct core_file_glob, and I can report that the crash did leave a 'core' >>>>> file in the data directory (so it's not a case of systemd commandeering >>>>> the core dump). Seems like core-file collection should've worked >>>>> ... unless maybe it's not covering TAP tests at all? >>>> I suspect that is it - there's not really a good way for the buildfarm >>>> client to even know where there could be data directories :(. >>> Does it need to? I'm envisioning "find tmp_check -name '$core_file_glob'" >>> or something along that line. >> Yea, it'd be doable that way. It'd be a bit harder to associate the core >> files with specific tests though. But I now checked, and it indeed >> checks for core files in a specific subset of tests, and that that test >> only globs inside the passed-in datadir. >> > working on it ... > > > cheers > > see <https://github.com/PGBuildFarm/client-code/commit/5361bb5ff01ee47e8998f5f8f44732ccc5a01183> cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Hi, On 2021-05-07 10:29:58 -0400, Tom Lane wrote: > I wrote: > > 1. No wonder we could not reproduce it anywhere else. I've warned > > the cfarm admins that their machine may be having hardware issues. > > I heard back from the machine's admin. The time of the crash I observed > matches exactly to these events in the kernel log: > > May 07 03:31:39 gcc202 kernel: dm-0: writeback error on inode 2148294407, offset 0, sector 159239256 > May 07 03:31:39 gcc202 kernel: sunvdc: vdc_tx_trigger() failure, err=-11 > May 07 03:31:39 gcc202 kernel: blk_update_request: I/O error, dev vdiskc, sector 157618896 op 0x1:(WRITE) flags 0x4800phys_seg 16 prio class 0 > > So it's not a mirage. The admin seems to think it might be a kernel > bug though. Isn't this a good reason to have at least some tests run with fsync=on? It makes a ton of sense for buildfarm animals to disable fsync to achieve acceptable performance. Having something in there that nevertheless does some light exercise of the fsync code doesn't seem bad? Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > Isn't this a good reason to have at least some tests run with fsync=on? Why? I can certainly see an argument for running some buildfarm animals with fsync on (for all tests). I don't see a reason for forcing them all to run some tests that way; and if I were going to do that, I doubt that 008_fsm_truncation.pl would be the one I would pick. I think it's nothing but sloppiness that that one is out of step with all the rest. IMO, if a buildfarm owner sets fsync = off, they mean off. They don't mean "maybe". regards, tom lane
On Fri, May 07, 2021 at 01:18:19PM -0400, Tom Lane wrote: > Realizing that 9989d37d prevents the assertion failure, I went > to see if thorntail had shown EIO failures without assertions. > Looking back 180 days, I found these: > > sysname | branch | snapshot | stage | l > -----------+---------------+---------------------+--------------------+------------------------------------------------------------------------------------------------------------------------------------------------ > thorntail | HEAD | 2021-03-19 21:28:15 | recoveryCheck | 2021-03-20 00:48:48.117 MSK [4089174:11] 008_fsm_truncation.plPANIC: could not fdatasync file "000000010000000000000002": Input/output error > thorntail | HEAD | 2021-04-06 16:08:10 | recoveryCheck | 2021-04-06 19:30:54.103 MSK [3355008:11] 008_fsm_truncation.plPANIC: could not fdatasync file "000000010000000000000002": Input/output error > thorntail | REL9_6_STABLE | 2021-04-12 02:38:04 | pg_basebackupCheck | pg_basebackup: could not fsync file "000000010000000000000013":Input/output error > > So indeed the kernel-or-hardware problem is affecting other branches. Having a flaky buildfarm member is bad news. I'll LD_PRELOAD the attached to prevent fsync from reaching the kernel. Hopefully, that will make the hardware-or-kernel trouble unreachable. (Changing 008_fsm_truncation.pl wouldn't avoid this, because fsync=off doesn't affect syncs outside the backend.)
Attachment
Re: Anti-critical-section assertion failure in mcxt.c reached by walsender
From
Michael Paquier
Date:
On Fri, May 07, 2021 at 04:30:00PM -0400, Tom Lane wrote: > I can certainly see an argument for running some buildfarm animals > with fsync on (for all tests). I don't see a reason for forcing > them all to run some tests that way; and if I were going to do that, > I doubt that 008_fsm_truncation.pl would be the one I would pick. > I think it's nothing but sloppiness that that one is out of step with > all the rest. My take on this point is that using the configuration that can be enforced for each animal would be enough. I manage a small animal and this stuff can take a while to flush some data. Worth noting that using fsync=on has not been discussed on the original thread, and I don't see why that's necessary: https://www.postgresql.org/message-id/flat/CABOikdNr5vKucqyZH9s1Mh0XebLs_jRhKv6eJfNnD2wxTn%3D_9A%40mail.gmail.com So I would vote for removing it in this case. -- Michael
Attachment
Re: Anti-critical-section assertion failure in mcxt.c reached by walsender
From
Michael Paquier
Date:
On Fri, May 07, 2021 at 04:42:46PM +1200, Thomas Munro wrote: > Oh, and I see that 13 has 9989d37d "Remove XLogFileNameP() from the > tree" to fix this exact problem. I don't see that we'd be able to get a redesign of this area safe enough for a backpatch, but perhaps we (I?) had better put some extra effort in back-patching this commit while keeping XLogFileNameP() around for compatibility? How do people feel about that? -- Michael
Attachment
Hi, On 2021-05-07 17:14:18 -0700, Noah Misch wrote: > Having a flaky buildfarm member is bad news. I'll LD_PRELOAD the attached to > prevent fsync from reaching the kernel. Hopefully, that will make the > hardware-or-kernel trouble unreachable. (Changing 008_fsm_truncation.pl > wouldn't avoid this, because fsync=off doesn't affect syncs outside the > backend.) Not sure how reliable that is - there's other paths that could return an error, I think. If the root cause is the disk responding weirdly to write cache flushes, you could tell the kernel that that the disk has no write cache (e.g. echo write through > /sys/block/sda/queue/write_cache). Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2021-05-07 17:14:18 -0700, Noah Misch wrote: >> Having a flaky buildfarm member is bad news. I'll LD_PRELOAD the attached to >> prevent fsync from reaching the kernel. Hopefully, that will make the >> hardware-or-kernel trouble unreachable. (Changing 008_fsm_truncation.pl >> wouldn't avoid this, because fsync=off doesn't affect syncs outside the >> backend.) > Not sure how reliable that is - there's other paths that could return an > error, I think. If the root cause is the disk responding weirdly to > write cache flushes, you could tell the kernel that that the disk has no > write cache (e.g. echo write through > /sys/block/sda/queue/write_cache). I seriously doubt Noah has root on that machine. More to the point, the admin told me it's a VM (or LDOM, whatever that is) under a Solaris host, so there's no direct hardware access going on anyway. He didn't say in so many words, but I suspect the reason he's suspecting kernel bugs is that there's nothing going wrong so far as the host OS is concerned. regards, tom lane
On Fri, May 07, 2021 at 10:18:14PM -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2021-05-07 17:14:18 -0700, Noah Misch wrote: > >> Having a flaky buildfarm member is bad news. I'll LD_PRELOAD the attached to > >> prevent fsync from reaching the kernel. Hopefully, that will make the > >> hardware-or-kernel trouble unreachable. (Changing 008_fsm_truncation.pl > >> wouldn't avoid this, because fsync=off doesn't affect syncs outside the > >> backend.) > > > Not sure how reliable that is - there's other paths that could return an > > error, I think. Yep, one can imagine a failure at close() or something. All the non-HEAD buildfarm failures are at some *sync call, so I'm optimistic about getting mileage from this. (I didn't check the more-numerous HEAD failures.) If it's not enough, I may move the farm directory to tmpfs. > > If the root cause is the disk responding weirdly to > > write cache flushes, you could tell the kernel that that the disk has no > > write cache (e.g. echo write through > /sys/block/sda/queue/write_cache). > > I seriously doubt Noah has root on that machine. If I can make the case for that setting being a good thing for the VM's users generally, I probably can file a ticket and get it done. > More to the point, the admin told me it's a VM (or LDOM, whatever that is) > under a Solaris host, so there's no direct hardware access going on > anyway. He didn't say in so many words, but I suspect the reason he's > suspecting kernel bugs is that there's nothing going wrong so far as the > host OS is concerned.
On Sat, May 8, 2021 at 2:30 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > May 07 03:31:39 gcc202 kernel: sunvdc: vdc_tx_trigger() failure, err=-11 That's -EAGAIN (assuming errnos match x86) and I guess it indicates that VDC_MAX_RETRIES is exceeded here: https://github.com/torvalds/linux/blob/master/drivers/block/sunvdc.c#L451 https://github.com/torvalds/linux/blob/master/drivers/block/sunvdc.c#L526 One theory is that the hypervisor/host is occasionally too swamped to service the request queue fast enough over a ~10ms period, given that vio_ldc_send() itself retries 1000 times with a 1us sleep, the outer loop tries ten times, and ldc.c's write_nonraw() reports -EAGAIN when there is no space for the message. (Alternatively, it's trying to send a message that's too big for the channel, the channel is corrupted by bugs, or my fly-by of this code I'd never heard of before now is just way off...)
On Sat, May 08, 2021 at 04:57:54PM +1200, Thomas Munro wrote: > On Sat, May 8, 2021 at 2:30 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > May 07 03:31:39 gcc202 kernel: sunvdc: vdc_tx_trigger() failure, err=-11 > > That's -EAGAIN (assuming errnos match x86) and I guess it indicates > that VDC_MAX_RETRIES is exceeded here: > > https://github.com/torvalds/linux/blob/master/drivers/block/sunvdc.c#L451 > https://github.com/torvalds/linux/blob/master/drivers/block/sunvdc.c#L526 > > One theory is that the hypervisor/host is occasionally too swamped to > service the request queue fast enough over a ~10ms period, given that > vio_ldc_send() itself retries 1000 times with a 1us sleep, the outer > loop tries ten times, and ldc.c's write_nonraw() reports -EAGAIN when > there is no space for the message. (Alternatively, it's trying to > send a message that's too big for the channel, the channel is > corrupted by bugs, or my fly-by of this code I'd never heard of before > now is just way off...) Nice discovery. From https://github.com/torvalds/linux/commit/a11f6ca9aef989b56cd31ff4ee2af4fb31a172ec I see those details are 2.5 years old, somewhat young relative to the driver as a whole. I don't know which part should change, though.