Thread: Re: [COMMITTERS] pgsql: Modify the isolation tester so that multiple sessions can wait.
Re: [COMMITTERS] pgsql: Modify the isolation tester so that multiple sessions can wait.
From
Alvaro Herrera
Date:
Robert Haas wrote: > Modify the isolation tester so that multiple sessions can wait. Mikael was kind enough to set up an OpenBSD 5.9 buildfarm member, and it's been failing in isolationtester. (I kindly accepted my suggestion to put it to run even though it is failing so that we could look at the logs). I think this commit is the problem; see the backtrace he sent me on private email. #0 0x00001e18c62af87a in thrkill () at <stdin>:2 2 <stdin>: No such file or directory. in <stdin> (gdb) bt #0 0x00001e18c62af87a in thrkill () at <stdin>:2 #1 0x00001e18c62aaf39 in *_libc_abort () at /usr/src/lib/libc/stdlib/abort.c:52 #2 0x00001e18c62a7838 in *_libc_memcpy (dst0=0x0, src0=0x6, length=0) at /usr/src/lib/libc/string/memcpy.c:65 #3 0x00001e16a7904a7e in run_permutation (testspec=0x1e16a7d08b20, nsteps=11, steps=Variable "steps" is not available. ) at isolationtester.c:617 #4 0x00001e16a7904d7c in run_testspec (testspec=0x1e16a7d08b20) at isolationtester.c:369 #5 0x00001e16a79053ce in main (argc=Variable "argc" is not available. ) at isolationtester.c:251 -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [COMMITTERS] pgsql: Modify the isolation tester so that multiple sessions can wait.
From
Robert Haas
Date:
On Wed, Apr 27, 2016 at 1:51 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Robert Haas wrote: >> Modify the isolation tester so that multiple sessions can wait. > > Mikael was kind enough to set up an OpenBSD 5.9 buildfarm member, and > it's been failing in isolationtester. (I kindly accepted my suggestion > to put it to run even though it is failing so that we could look at the > logs). I think this commit is the problem; see the backtrace he sent me > on private email. > > #0 0x00001e18c62af87a in thrkill () at <stdin>:2 > 2 <stdin>: No such file or directory. > in <stdin> > (gdb) bt > #0 0x00001e18c62af87a in thrkill () at <stdin>:2 > #1 0x00001e18c62aaf39 in *_libc_abort () at > /usr/src/lib/libc/stdlib/abort.c:52 > #2 0x00001e18c62a7838 in *_libc_memcpy (dst0=0x0, src0=0x6, length=0) at > /usr/src/lib/libc/string/memcpy.c:65 > #3 0x00001e16a7904a7e in run_permutation (testspec=0x1e16a7d08b20, > nsteps=11, steps=Variable "steps" is not available. > ) at isolationtester.c:617 > #4 0x00001e16a7904d7c in run_testspec (testspec=0x1e16a7d08b20) at > isolationtester.c:369 > #5 0x00001e16a79053ce in main (argc=Variable "argc" is not available. > ) at isolationtester.c:251 That looks like malloc() returned NULL. I noticed when writing that patch that isolationtester has never had any checks for malloc returning NULL, which is bad, and probably worth fixing, but I didn't choose to stop and fix it at that time. I don't know off-hand why you see that problem starting at this commit and not before, or why it doesn't show up on other machines. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [COMMITTERS] pgsql: Modify the isolation tester so that multiple sessions can wait.
From
Alvaro Herrera
Date:
Robert Haas wrote: > That looks like malloc() returned NULL. I noticed when writing that > patch that isolationtester has never had any checks for malloc > returning NULL, which is bad, and probably worth fixing, but I didn't > choose to stop and fix it at that time. I didn't actually check closely but I wondered whether the pointer arithmetic is actually correct. Note that the memcpy length is zero. I doubt malloc returning null is the problem; how could it happen exactly at the same spot every time the suite has run? > I don't know off-hand why you see that problem starting at this commit > and not before, or why it doesn't show up on other machines. Perhaps it's only a problem for OpenBSD's libc and not for glibc which is the most common. The only other openbsd machine in buildfarm doesn't run the isolation tests. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: [COMMITTERS] pgsql: Modify the isolation tester so that multiple sessions can wait.
From
Thomas Munro
Date:
On Thu, Apr 28, 2016 at 7:02 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Robert Haas wrote: > >> That looks like malloc() returned NULL. I noticed when writing that >> patch that isolationtester has never had any checks for malloc >> returning NULL, which is bad, and probably worth fixing, but I didn't >> choose to stop and fix it at that time. > > I didn't actually check closely but I wondered whether the pointer > arithmetic is actually correct. Note that the memcpy length is zero. > I doubt malloc returning null is the problem; how could it happen > exactly at the same spot every time the suite has run? > >> I don't know off-hand why you see that problem starting at this commit >> and not before, or why it doesn't show up on other machines. > > Perhaps it's only a problem for OpenBSD's libc and not for glibc which > is the most common. The only other openbsd machine in buildfarm doesn't > run the isolation tests. Also happens on OpenBSD 5.8. Isn't this a classic case where memmove is called for? Replacing the memcpy at line 617 with memmove makes the tests run successfully, but at first glance the other two instances of memcpy in run_permutation should also be changed to memmove, no? -- Thomas Munro http://www.enterprisedb.com
Re: Re: [COMMITTERS] pgsql: Modify the isolation tester so that multiple sessions can wait.
From
Thomas Munro
Date:
On Thu, Apr 28, 2016 at 8:46 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Thu, Apr 28, 2016 at 7:02 AM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: >> Robert Haas wrote: >> >>> That looks like malloc() returned NULL. I noticed when writing that >>> patch that isolationtester has never had any checks for malloc >>> returning NULL, which is bad, and probably worth fixing, but I didn't >>> choose to stop and fix it at that time. >> >> I didn't actually check closely but I wondered whether the pointer >> arithmetic is actually correct. Note that the memcpy length is zero. >> I doubt malloc returning null is the problem; how could it happen >> exactly at the same spot every time the suite has run? >> >>> I don't know off-hand why you see that problem starting at this commit >>> and not before, or why it doesn't show up on other machines. >> >> Perhaps it's only a problem for OpenBSD's libc and not for glibc which >> is the most common. The only other openbsd machine in buildfarm doesn't >> run the isolation tests. > > Also happens on OpenBSD 5.8. Isn't this a classic case where memmove > is called for? Replacing the memcpy at line 617 with memmove makes > the tests run successfully, but at first glance the other two > instances of memcpy in run_permutation should also be changed to > memmove, no? That abort at memcpy.c:65 is indeed a check for overlapping regions: http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/lib/libc/string/memcpy.c?rev=1.2&content-type=text/x-cvsweb-markup That leaves the question of why the backtrace shows arguments that wouldn't trigger it (for example memcpy exits early for length == 0). But it appears that those stack arguments are mangled by the time gdb dumps them, like this: #include <string.h> int main(int argc, char *argv[]) { char buffer[] = "hello world"; memcpy(&buffer[0], &buffer[1], 2); return 0; } #0 0x00000bce27eab90a in kill () at <stdin>:2 #1 0x00000bce27ee5b19 in abort () at /usr/src/lib/libc/stdlib/abort.c:53 #2 0x00000bce27ebcde8 in memcpy (dst0=0xf7ec5, src0=0x6, length=0) at /usr/src/lib/libc/string/memcpy.c:65 #3 0x00000bcb5fc00c7a in main (argc=1, argv=0x7f7fffff6818) at foo.c:6 Suggested patch attached. -- Thomas Munro http://www.enterprisedb.com
Attachment
Re: Re: [COMMITTERS] pgsql: Modify the isolation tester so that multiple sessions can wait.
From
Tom Lane
Date:
Thomas Munro <thomas.munro@enterprisedb.com> writes: >> Also happens on OpenBSD 5.8. Isn't this a classic case where memmove >> is called for? Replacing the memcpy at line 617 with memmove makes >> the tests run successfully, but at first glance the other two >> instances of memcpy in run_permutation should also be changed to >> memmove, no? Yeah, that's clearly busted. Surprising that it has not failed on any other platforms. > Suggested patch attached. Will push in a moment. regards, tom lane
Re: Re: [COMMITTERS] pgsql: Modify the isolation tester so that multiple sessions can wait.
From
Mikael Kjellström
Date:
On 2016-04-28 00:15, Tom Lane wrote: > Thomas Munro <thomas.munro@enterprisedb.com> writes: >>> Also happens on OpenBSD 5.8. Isn't this a classic case where memmove >>> is called for? Replacing the memcpy at line 617 with memmove makes >>> the tests run successfully, but at first glance the other two >>> instances of memcpy in run_permutation should also be changed to >>> memmove, no? > > Yeah, that's clearly busted. Surprising that it has not failed on > any other platforms. > >> Suggested patch attached. > > Will push in a moment. And that seems to have done the trick. I started a manual run of HEAD on curculio and now it's green on the build farm. Thanks! /Mikael
Re: Re: [COMMITTERS] pgsql: Modify the isolation tester so that multiple sessions can wait.
From
Tom Lane
Date:
Mikael Kjellström <mikael.kjellstrom@mksoft.nu> writes: > And that seems to have done the trick. I started a manual run of HEAD > on curculio and now it's green on the build farm. > Thanks! Thank *you*, for putting up a buildfarm member that caught this bug. regards, tom lane