Thread: REL_13_STABLE Windows 10 Regression Failures
All, I have been investigating an odd Windows regression failure that I encountered. When building from source on a Windows 10 VM using MinGW (8.1.0), I get a random number of regression failures off the REL_13_STABLE branch. I debugged this a little bit and found out that the "random" number of failures is fully dependent on the machine and if I disable the "stats_ext.sql" regression test; all other tests pass without issue. When the "stats_ext.sql" regression test runs, it causes a database exception and PostgreSQL crashes. This is why the number of regression failures is random: it depends upon the speed of the machine that is running them. However, this behavior is not seen when building off of master. I did some digging and determined that on the REL_13_STABLE branch this instability was introduced with this commit "b380484a850b6bf7d9fc0d85c555a2366e38451f"[1]. This corresponds to commit "19f5a37b9fc48a12c77edafb732543875da2f4a3"[1] on master. I worked backwards from there to determine when the regressions stopped failing and determined that with commit "be0a6666656ec3f68eb7d8e7abab5139fcd47012"[2] the regression tests are no longer failing. I then checked out REL_13_STABLE and cherry picked the "be0a6666656ec3f68eb7d8e7abab5139fcd47012" commit and ran a clean full build of postgres. This build ran through with no issues and passed all 200 tests. I have also built using REL_12_STABLE and earlier with no regression failures. This failure also does not seem to occur on a Server 2016 based system as the buildfarm machines that run these builds do not see this regression failure. I have not fully investigated why it appears to be confined to Windows 10 or what the issue is that causes the PG database to die during regression. I was hoping that I could get commit "be0a6666656ec3f68eb7d8e7abab5139fcd47012" backpatched to the REL_13_STABLE branch so these regression failures would be resolved. Thanks, Heath
Heath Lord <heath.lord@crunchydata.com> writes: > When building from source on a Windows 10 VM using MinGW (8.1.0), I > get a random number of regression failures off the REL_13_STABLE > branch. I debugged this a little bit and found out that the "random" > number of failures is fully dependent on the machine and if I disable > the "stats_ext.sql" regression test; all other tests pass without > issue. When the "stats_ext.sql" regression test runs, it causes a > database exception and PostgreSQL crashes. Hmph ... it's weird that we have not seen this in the buildfarm. Have you tried to extract any info from the crash, like a stack trace? > I did some digging and determined that on the REL_13_STABLE branch > this instability was introduced with this commit > "b380484a850b6bf7d9fc0d85c555a2366e38451f"[1]. This corresponds to > commit "19f5a37b9fc48a12c77edafb732543875da2f4a3"[1] on master. I > worked backwards from there to determine when the regressions stopped > failing and determined that with commit > "be0a6666656ec3f68eb7d8e7abab5139fcd47012"[2] the regression tests are > no longer failing. I'm having a hard time believing that b380484a8 would have introduced a portability problem, and an even harder time believing that be0a66666 would have resolved it if so. What seems more likely is that there's some underlying issue such as a memory stomp, that the first commit accidentally exposed and the second one accidentally hid again. So, even if back-patching be0a66666 seemed feasible from a stability standpoint (which I don't think it is), I fear it'd just mask a problem that would eventually bite us again. So I think we need to dig down and try to identify the root cause, without any preconceptions about how to fix it. Again, a stack trace would be pretty useful. Or at least some info about which step of stats_ext.sql is crashing. regards, tom lane
Tom, We are working to set up our environment to allow us to get a stack trace as we do not have any of the Visual Studios stuff installed right now. However, I thought I would send you a little more information while we are trying to get that working. Going through the stats_ext.sql file line by line with a freshly built REL_13_STABLE database stood up we have determined that running any of the following commands back to back will cause the database to crash: CREATE STATISTICS tst ON relnatts + relpages FROM pg_class; CREATE STATISTICS tst ON (relpages, reltuples) FROM pg_class; If you run another command in between them like: SELECT version(); Then it will not crash when you run either of those commands again. However if you run any combination of those 2 commands back to back it will crash the database. The output from the psql instance after stepping through the stats_ext.sql file is in the stats_ext_psql_output.txt file attached. The information from the postgres logfile for the above is attached in the pg_logfile_output.txt file. Hopefully, this will at least give you some information while we are working on getting the backtrace. Thanks. -Heath On Fri, Oct 30, 2020 at 1:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Heath Lord <heath.lord@crunchydata.com> writes: > > When building from source on a Windows 10 VM using MinGW (8.1.0), I > > get a random number of regression failures off the REL_13_STABLE > > branch. I debugged this a little bit and found out that the "random" > > number of failures is fully dependent on the machine and if I disable > > the "stats_ext.sql" regression test; all other tests pass without > > issue. When the "stats_ext.sql" regression test runs, it causes a > > database exception and PostgreSQL crashes. > > Hmph ... it's weird that we have not seen this in the buildfarm. > Have you tried to extract any info from the crash, like a stack trace? > > > I did some digging and determined that on the REL_13_STABLE branch > > this instability was introduced with this commit > > "b380484a850b6bf7d9fc0d85c555a2366e38451f"[1]. This corresponds to > > commit "19f5a37b9fc48a12c77edafb732543875da2f4a3"[1] on master. I > > worked backwards from there to determine when the regressions stopped > > failing and determined that with commit > > "be0a6666656ec3f68eb7d8e7abab5139fcd47012"[2] the regression tests are > > no longer failing. > > I'm having a hard time believing that b380484a8 would have introduced > a portability problem, and an even harder time believing that be0a66666 > would have resolved it if so. What seems more likely is that there's > some underlying issue such as a memory stomp, that the first commit > accidentally exposed and the second one accidentally hid again. > So, even if back-patching be0a66666 seemed feasible from a stability > standpoint (which I don't think it is), I fear it'd just mask a problem > that would eventually bite us again. > > So I think we need to dig down and try to identify the root cause, > without any preconceptions about how to fix it. Again, a stack trace > would be pretty useful. Or at least some info about which step of > stats_ext.sql is crashing. > > regards, tom lane
Attachment
On Fri, Oct 30, 2020 at 3:47 PM Heath Lord <heath.lord@crunchydata.com> wrote: > > Tom, > We are working to set up our environment to allow us to get a stack > trace as we do not have any of the Visual Studios stuff installed > right now. However, I thought I would send you a little more > information while we are trying to get that working. > Going through the stats_ext.sql file line by line with a freshly > built REL_13_STABLE database stood up we have determined that running > any of the following commands back to back will cause the database to > crash: > > CREATE STATISTICS tst ON relnatts + relpages FROM pg_class; > CREATE STATISTICS tst ON (relpages, reltuples) FROM pg_class; > > If you run another command in between them like: > > SELECT version(); > > Then it will not crash when you run either of those commands again. > However if you run any combination of those 2 commands back to back it > will crash the database. The output from the psql instance after > stepping through the stats_ext.sql file is in the > stats_ext_psql_output.txt file attached. > > The information from the postgres logfile for the above is attached > in the pg_logfile_output.txt file. > > Hopefully, this will at least give you some information while we > are working on getting the backtrace. Thanks. > > -Heath > > On Fri, Oct 30, 2020 at 1:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > Heath Lord <heath.lord@crunchydata.com> writes: > > > When building from source on a Windows 10 VM using MinGW (8.1.0), I > > > get a random number of regression failures off the REL_13_STABLE > > > branch. I debugged this a little bit and found out that the "random" > > > number of failures is fully dependent on the machine and if I disable > > > the "stats_ext.sql" regression test; all other tests pass without > > > issue. When the "stats_ext.sql" regression test runs, it causes a > > > database exception and PostgreSQL crashes. > > > > Hmph ... it's weird that we have not seen this in the buildfarm. > > Have you tried to extract any info from the crash, like a stack trace? > > > > > I did some digging and determined that on the REL_13_STABLE branch > > > this instability was introduced with this commit > > > "b380484a850b6bf7d9fc0d85c555a2366e38451f"[1]. This corresponds to > > > commit "19f5a37b9fc48a12c77edafb732543875da2f4a3"[1] on master. I > > > worked backwards from there to determine when the regressions stopped > > > failing and determined that with commit > > > "be0a6666656ec3f68eb7d8e7abab5139fcd47012"[2] the regression tests are > > > no longer failing. > > > > I'm having a hard time believing that b380484a8 would have introduced > > a portability problem, and an even harder time believing that be0a66666 > > would have resolved it if so. What seems more likely is that there's > > some underlying issue such as a memory stomp, that the first commit > > accidentally exposed and the second one accidentally hid again. > > So, even if back-patching be0a66666 seemed feasible from a stability > > standpoint (which I don't think it is), I fear it'd just mask a problem > > that would eventually bite us again. > > > > So I think we need to dig down and try to identify the root cause, > > without any preconceptions about how to fix it. Again, a stack trace > > would be pretty useful. Or at least some info about which step of > > stats_ext.sql is crashing. > > > > regards, tom lane All, I was finally able to get a stack trace. I apologize for it taking so long, but for some reason when I did a configure with "--enable-cassert --enable-debug" all of the regression tests passed. However I finally was able to get it to work with only using "--enable-debug" and then using the mingw version of gdb to get what I have attached. Please let me know if this contains any useful information. I also noticed that causing the crash was slightly different and it definitely behaved differently with debug enabled. Thank you in advance for any help. -Heath
Attachment
On 2020-Nov-09, Heath Lord wrote: > Thread 1 received signal ?, Unknown signal. > [Switching to Thread 4440.0x39c] > 0x00007ffbf387e598 in ntdll!RtlRaiseStatus () from C:\Windows\SYSTEM32\ntdll.dll > #0 0x00007ffbf387e598 in ntdll!RtlRaiseStatus () from C:\Windows\SYSTEM32\ntdll.dll > #1 0x00007ffbf37e6673 in ntdll!RtlUnwindEx () from C:\Windows\SYSTEM32\ntdll.dll > #2 0x00007ffbf0a734bd in msvcrt!_setjmpex () from C:\Windows\System32\msvcrt.dll > #3 0x000000000082535a in pg_re_throw () at elog.c:1717 > #4 0x000000000082517a in errfinish (filename=<optimized out>, filename@entry=0x9595a7 <__func__.114045+79> "statscmds.c",lineno=lineno@entry=204, funcname=funcname@entry=0x959990 <__func__.111351> "CreateStatistics") at elog.c:502 > #5 0x000000000056552c in CreateStatistics (stmt=0x4c05858, stmt@entry=0x4b6ee30) at statscmds.c:202 Hmm, line 202 is the ereport in this test: if (!IsA(expr, ColumnRef)) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("only simple column references are allowed in CREATE STATISTICS"))); Not sure why that gives rise to the upper parts of the stack there.
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > Hmm, line 202 is the ereport in this test: > if (!IsA(expr, ColumnRef)) > ereport(ERROR, > (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > errmsg("only simple column references are allowed in CREATE STATISTICS"))); > Not sure why that gives rise to the upper parts of the stack there. Yeah, it seems like the error-recovery longjmp has suddenly broken; but why here? There's nothing unusual about this specific error case. regards, tom lane
On Wed, Nov 11, 2020 at 2:32 AM Andres Freund <andres@anarazel.de> wrote:
High,
On 2020-11-09 12:18:34 -0500, Tom Lane wrote:
> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> > Hmm, line 202 is the ereport in this test:
>
> > if (!IsA(expr, ColumnRef))
> > ereport(ERROR,
> > (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> > errmsg("only simple column references are allowed in CREATE STATISTICS")));
>
> > Not sure why that gives rise to the upper parts of the stack there.
>
> Yeah, it seems like the error-recovery longjmp has suddenly broken;
> but why here? There's nothing unusual about this specific error case.
Perhaps PG_exception_stack got corrupted somehow? An oversized write to
a neighboring var?
Not sure if that works on mingw, but building with address sanitizer /
asan might be informative.
This looks like a known issue in MinGW64 builds [1], which is derived from an also known issue in MSVC's handling of setjmp/longjmp [2].
Regards,
Juan José Santamaría Flecha
=?UTF-8?Q?Juan_Jos=C3=A9_Santamar=C3=ADa_Flecha?= <juanjo.santamaria@gmail.com> writes: > On Wed, Nov 11, 2020 at 2:32 AM Andres Freund <andres@anarazel.de> wrote: >> On 2020-11-09 12:18:34 -0500, Tom Lane wrote: >>> Yeah, it seems like the error-recovery longjmp has suddenly broken; >>> but why here? There's nothing unusual about this specific error case. > This looks like a known issue in MinGW64 builds [1], which is derived from > an also known issue in MSVC's handling of setjmp/longjmp [2]. > [1] https://sourceforge.net/p/mingw-w64/bugs/406/ > [2] https://docs.microsoft.com/en-us/cpp/cpp/using-setjmp-longjmp Not sure I believe that. The trouble with diagnosing this as a generic "setjmp is broken" situation is that then it's not very credible that the regression tests got this far. As for [2], that's talking specifically about longjmp not executing C++ destructors, which isn't an issue for us. regards, tom lane