Thread: Re: [COMMITTERS] pgsql: Modify the isolation tester so that multiple sessions can wait.

Robert Haas wrote:
> Modify the isolation tester so that multiple sessions can wait.

Mikael was kind enough to set up an OpenBSD 5.9 buildfarm member, and
it's been failing in isolationtester.  (I kindly accepted my suggestion
to put it to run even though it is failing so that we could look at the
logs).  I think this commit is the problem; see the backtrace he sent me
on private email.

#0  0x00001e18c62af87a in thrkill () at <stdin>:2
2       <stdin>: No such file or directory.       in <stdin>
(gdb) bt
#0  0x00001e18c62af87a in thrkill () at <stdin>:2
#1  0x00001e18c62aaf39 in *_libc_abort () at
/usr/src/lib/libc/stdlib/abort.c:52
#2  0x00001e18c62a7838 in *_libc_memcpy (dst0=0x0, src0=0x6, length=0) at
/usr/src/lib/libc/string/memcpy.c:65
#3  0x00001e16a7904a7e in run_permutation (testspec=0x1e16a7d08b20,
nsteps=11, steps=Variable "steps" is not available.
) at isolationtester.c:617
#4  0x00001e16a7904d7c in run_testspec (testspec=0x1e16a7d08b20) at
isolationtester.c:369
#5  0x00001e16a79053ce in main (argc=Variable "argc" is not available.
) at isolationtester.c:251

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On Wed, Apr 27, 2016 at 1:51 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Robert Haas wrote:
>> Modify the isolation tester so that multiple sessions can wait.
>
> Mikael was kind enough to set up an OpenBSD 5.9 buildfarm member, and
> it's been failing in isolationtester.  (I kindly accepted my suggestion
> to put it to run even though it is failing so that we could look at the
> logs).  I think this commit is the problem; see the backtrace he sent me
> on private email.
>
> #0  0x00001e18c62af87a in thrkill () at <stdin>:2
> 2       <stdin>: No such file or directory.
>         in <stdin>
> (gdb) bt
> #0  0x00001e18c62af87a in thrkill () at <stdin>:2
> #1  0x00001e18c62aaf39 in *_libc_abort () at
> /usr/src/lib/libc/stdlib/abort.c:52
> #2  0x00001e18c62a7838 in *_libc_memcpy (dst0=0x0, src0=0x6, length=0) at
> /usr/src/lib/libc/string/memcpy.c:65
> #3  0x00001e16a7904a7e in run_permutation (testspec=0x1e16a7d08b20,
> nsteps=11, steps=Variable "steps" is not available.
> ) at isolationtester.c:617
> #4  0x00001e16a7904d7c in run_testspec (testspec=0x1e16a7d08b20) at
> isolationtester.c:369
> #5  0x00001e16a79053ce in main (argc=Variable "argc" is not available.
> ) at isolationtester.c:251

That looks like malloc() returned NULL.  I noticed when writing that
patch that isolationtester has never had any checks for malloc
returning NULL, which is bad, and probably worth fixing, but I didn't
choose to stop and fix it at that time.

I don't know off-hand why you see that problem starting at this commit
and not before, or why it doesn't show up on other machines.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Robert Haas wrote:

> That looks like malloc() returned NULL.  I noticed when writing that
> patch that isolationtester has never had any checks for malloc
> returning NULL, which is bad, and probably worth fixing, but I didn't
> choose to stop and fix it at that time.

I didn't actually check closely but I wondered whether the pointer
arithmetic is actually correct.  Note that the memcpy length is zero.
I doubt malloc returning null is the problem; how could it happen
exactly at the same spot every time the suite has run?

> I don't know off-hand why you see that problem starting at this commit
> and not before, or why it doesn't show up on other machines.

Perhaps it's only a problem for OpenBSD's libc and not for glibc which
is the most common.  The only other openbsd machine in buildfarm doesn't
run the isolation tests.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On Thu, Apr 28, 2016 at 7:02 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Robert Haas wrote:
>
>> That looks like malloc() returned NULL.  I noticed when writing that
>> patch that isolationtester has never had any checks for malloc
>> returning NULL, which is bad, and probably worth fixing, but I didn't
>> choose to stop and fix it at that time.
>
> I didn't actually check closely but I wondered whether the pointer
> arithmetic is actually correct.  Note that the memcpy length is zero.
> I doubt malloc returning null is the problem; how could it happen
> exactly at the same spot every time the suite has run?
>
>> I don't know off-hand why you see that problem starting at this commit
>> and not before, or why it doesn't show up on other machines.
>
> Perhaps it's only a problem for OpenBSD's libc and not for glibc which
> is the most common.  The only other openbsd machine in buildfarm doesn't
> run the isolation tests.

Also happens on OpenBSD 5.8.  Isn't this a classic case where memmove
is called for?  Replacing the memcpy at line 617 with memmove makes
the tests run successfully, but at first glance the other two
instances of memcpy in run_permutation should also be changed to
memmove, no?

-- 
Thomas Munro
http://www.enterprisedb.com



On Thu, Apr 28, 2016 at 8:46 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Thu, Apr 28, 2016 at 7:02 AM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
>> Robert Haas wrote:
>>
>>> That looks like malloc() returned NULL.  I noticed when writing that
>>> patch that isolationtester has never had any checks for malloc
>>> returning NULL, which is bad, and probably worth fixing, but I didn't
>>> choose to stop and fix it at that time.
>>
>> I didn't actually check closely but I wondered whether the pointer
>> arithmetic is actually correct.  Note that the memcpy length is zero.
>> I doubt malloc returning null is the problem; how could it happen
>> exactly at the same spot every time the suite has run?
>>
>>> I don't know off-hand why you see that problem starting at this commit
>>> and not before, or why it doesn't show up on other machines.
>>
>> Perhaps it's only a problem for OpenBSD's libc and not for glibc which
>> is the most common.  The only other openbsd machine in buildfarm doesn't
>> run the isolation tests.
>
> Also happens on OpenBSD 5.8.  Isn't this a classic case where memmove
> is called for?  Replacing the memcpy at line 617 with memmove makes
> the tests run successfully, but at first glance the other two
> instances of memcpy in run_permutation should also be changed to
> memmove, no?

That abort at memcpy.c:65 is indeed a check for overlapping regions:

http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/lib/libc/string/memcpy.c?rev=1.2&content-type=text/x-cvsweb-markup

That leaves the question of why the backtrace shows arguments that
wouldn't trigger it (for example memcpy exits early for length == 0).
But it appears that those stack arguments are mangled by the time gdb
dumps them, like this:

#include <string.h>
int main(int argc, char *argv[])
{
  char buffer[] = "hello world";
  memcpy(&buffer[0], &buffer[1], 2);
  return 0;
}

#0  0x00000bce27eab90a in kill () at <stdin>:2
#1  0x00000bce27ee5b19 in abort () at /usr/src/lib/libc/stdlib/abort.c:53
#2  0x00000bce27ebcde8 in memcpy (dst0=0xf7ec5, src0=0x6, length=0) at
/usr/src/lib/libc/string/memcpy.c:65
#3  0x00000bcb5fc00c7a in main (argc=1, argv=0x7f7fffff6818) at foo.c:6

Suggested patch attached.

--
Thomas Munro
http://www.enterprisedb.com

Attachment
Thomas Munro <thomas.munro@enterprisedb.com> writes:
>> Also happens on OpenBSD 5.8.  Isn't this a classic case where memmove
>> is called for?  Replacing the memcpy at line 617 with memmove makes
>> the tests run successfully, but at first glance the other two
>> instances of memcpy in run_permutation should also be changed to
>> memmove, no?

Yeah, that's clearly busted.  Surprising that it has not failed on
any other platforms.

> Suggested patch attached.

Will push in a moment.
        regards, tom lane



On 2016-04-28 00:15, Tom Lane wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
>>> Also happens on OpenBSD 5.8.  Isn't this a classic case where memmove
>>> is called for?  Replacing the memcpy at line 617 with memmove makes
>>> the tests run successfully, but at first glance the other two
>>> instances of memcpy in run_permutation should also be changed to
>>> memmove, no?
>
> Yeah, that's clearly busted.  Surprising that it has not failed on
> any other platforms.
>
>> Suggested patch attached.
>
> Will push in a moment.

And that seems to have done the trick.  I started a manual run of HEAD 
on curculio and now it's green on the build farm.

Thanks!

/Mikael



Mikael Kjellström <mikael.kjellstrom@mksoft.nu> writes:
> And that seems to have done the trick.  I started a manual run of HEAD 
> on curculio and now it's green on the build farm.

> Thanks!

Thank *you*, for putting up a buildfarm member that caught this bug.
        regards, tom lane