Thread: Buildfarm failures for hash indexes: buffer leaks

Buildfarm failures for hash indexes: buffer leaks

From
Michael Paquier
Date:
Hi all,

Looking at the buildfarm, serinus and moonjelly have just complained
about the failure of $subject:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2018-10-22%2006%3A34%3A02
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=moonjelly&dt=2018-10-20%2015%3A17%3A02

The first failure is unrelated to the involved commits, as they touched
completely different areas of the code:
  INSERT INTO hash_split_heap SELECT a/2 FROM generate_series(1, 25000) a;
+ WARNING:  buffer refcount leak: [6481] (rel=base/16384/32349, blockNum=156, flags=0x93800000, refcount=1 1)

And versions older than HEAD do not complain.

Any thoughts?
--
Michael

Attachment

Re: Buildfarm failures for hash indexes: buffer leaks

From
Fabien COELHO
Date:
Hello Michaël,

> The first failure is unrelated to the involved commits, as they touched
> completely different areas of the code:
>  INSERT INTO hash_split_heap SELECT a/2 FROM generate_series(1, 25000) a;
> + WARNING:  buffer refcount leak: [6481] (rel=base/16384/32349, blockNum=156, flags=0x93800000, refcount=1 1)
>
> And versions older than HEAD do not complain.
>
> Any thoughts?

Both animals use gcc experimental versions, which may rather underline a 
new bug in gcc head rather than an existing issue in pg. Or not.

-- 
Fabien.

Re: Buildfarm failures for hash indexes: buffer leaks

From
Amit Kapila
Date:
On Mon, Oct 22, 2018 at 1:42 PM Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>
> Hello Michaël,
>
> > The first failure is unrelated to the involved commits, as they touched
> > completely different areas of the code:
> >  INSERT INTO hash_split_heap SELECT a/2 FROM generate_series(1, 25000) a;
> > + WARNING:  buffer refcount leak: [6481] (rel=base/16384/32349, blockNum=156, flags=0x93800000, refcount=1 1)
> >
> > And versions older than HEAD do not complain.
> >
> > Any thoughts?
>
> Both animals use gcc experimental versions, which may rather underline a
> new bug in gcc head rather than an existing issue in pg. Or not.
>

It is possible, but what could be the possible theory?  The warning
indicates that somewhere we forgot to call ReleaseBuffer.  Today, I
had reviewed at the hash index code related to test case that is
failing but didn't find any obvious problem.  What should we our next
step?  Do we want to change gcc version and see?


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Buildfarm failures for hash indexes: buffer leaks

From
Tom Lane
Date:
Amit Kapila <amit.kapila16@gmail.com> writes:
> On Mon, Oct 22, 2018 at 1:42 PM Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>> Both animals use gcc experimental versions, which may rather underline a
>> new bug in gcc head rather than an existing issue in pg. Or not.

> It is possible, but what could be the possible theory?

It seems like the two feasible theories are (1) gcc bug, or (2) buffer
leak that only occurs in very narrow circumstances, perhaps from a race
condition.  Given that the hash index code hasn't changed meaningfully
in several months, I thought (1) seemed more probable.

            regards, tom lane


Re: Buildfarm failures for hash indexes: buffer leaks

From
Fabien COELHO
Date:
Hello Tom & Amit,

>>> Both animals use gcc experimental versions, which may rather underline a
>>> new bug in gcc head rather than an existing issue in pg. Or not.
>
>> It is possible, but what could be the possible theory?
>
> It seems like the two feasible theories are (1) gcc bug, or (2) buffer
> leak that only occurs in very narrow circumstances, perhaps from a race
> condition.  Given that the hash index code hasn't changed meaningfully
> in several months, I thought (1) seemed more probable.

Yep, that is my thought as well.

The problem is that this kind of issue is not simple to wrap-up as a gcc 
bug report, unlike other earlier instances that I forwarded to clang & gcc 
dev teams.

I'm in favor in waiting before trying to report it, to check whether the 
probable underlying gcc problem is detected, reported by someone else, and 
fixed in gcc head. If it persists, then we'll see.

-- 
Fabien.


Re: Buildfarm failures for hash indexes: buffer leaks

From
Andres Freund
Date:
On 2018-10-23 13:54:31 +0200, Fabien COELHO wrote:
> 
> Hello Tom & Amit,
> 
> > > > Both animals use gcc experimental versions, which may rather underline a
> > > > new bug in gcc head rather than an existing issue in pg. Or not.
> > 
> > > It is possible, but what could be the possible theory?
> > 
> > It seems like the two feasible theories are (1) gcc bug, or (2) buffer
> > leak that only occurs in very narrow circumstances, perhaps from a race
> > condition.  Given that the hash index code hasn't changed meaningfully
> > in several months, I thought (1) seemed more probable.
> 
> Yep, that is my thought as well.


FWIW, my animal 'serinus', which runs debian's gcc-snapshot shows the same problem:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2018-10-22%2006%3A34%3A02

So it seems much more likely to be 1).


> The problem is that this kind of issue is not simple to wrap-up as a gcc bug
> report, unlike other earlier instances that I forwarded to clang & gcc dev
> teams.
> 
> I'm in favor in waiting before trying to report it, to check whether the
> probable underlying gcc problem is detected, reported by someone else, and
> fixed in gcc head. If it persists, then we'll see.

I suspect the easiest thing to narrow it down would be to bisect the
problem in gcc :(

Greetings,

Andres Freund


Re: Buildfarm failures for hash indexes: buffer leaks

From
Michael Paquier
Date:
On Tue, Oct 23, 2018 at 07:50:57AM -0700, Andres Freund wrote:
> FWIW, my animal 'serinus', which runs debian's gcc-snapshot shows the same problem:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2018-10-22%2006%3A34%3A02
>
> So it seems much more likely to be 1).

Thanks all for looking at it.  Indeed it looks like a compiler issue
here.
--
Michael

Attachment

Re: Buildfarm failures for hash indexes: buffer leaks

From
Jeff Janes
Date:


On Tue, Oct 23, 2018 at 10:51 AM Andres Freund <andres@anarazel.de> wrote:
On 2018-10-23 13:54:31 +0200, Fabien COELHO wrote:
>
> Hello Tom & Amit,
>
> > > > Both animals use gcc experimental versions, which may rather underline a
> > > > new bug in gcc head rather than an existing issue in pg. Or not.
> >
> > > It is possible, but what could be the possible theory?
> >
> > It seems like the two feasible theories are (1) gcc bug, or (2) buffer
> > leak that only occurs in very narrow circumstances, perhaps from a race
> > condition.  Given that the hash index code hasn't changed meaningfully
> > in several months, I thought (1) seemed more probable.
>
> Yep, that is my thought as well.


FWIW, my animal 'serinus', which runs debian's gcc-snapshot shows the same problem:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2018-10-22%2006%3A34%3A02

So it seems much more likely to be 1).


> The problem is that this kind of issue is not simple to wrap-up as a gcc bug
> report, unlike other earlier instances that I forwarded to clang & gcc dev
> teams.
>
> I'm in favor in waiting before trying to report it, to check whether the
> probable underlying gcc problem is detected, reported by someone else, and
> fixed in gcc head. If it persists, then we'll see.

I suspect the easiest thing to narrow it down would be to bisect the
problem in gcc :(

Their commit r265241 is what broke the PostgreSQL build.  It also broke the compiler itself--at that commit it was no longer possible to build itself.  I had to --disable-bootstrap in order to get a r265241 compiler to test PostgreSQL on.

Their commit r265375 fixed the ability to compile itself, but built PostgreSQL binaries remain broken there and thereafter.

I ran this on a AWS c5d.4xlarge ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20180912 (ami-0f65671a86f061fcd)

Configuring PostgreSQL with ./configure --enable-cassert is necessary in order for the subsequent "make check" to experience the failure.

I'm using PostgreSQL v12dev commit 5953c99697621174f, but the problem is not sensitive to that and reproduces back to v10.0.

I haven't the slightest idea what to do with this information, and GCC's cryptic SVN commit messages don't offer much insight to me.

Cheers,

Jeff

Re: Buildfarm failures for hash indexes: buffer leaks

From
Fabien COELHO
Date:
Hello Jeff,

>> I suspect the easiest thing to narrow it down would be to bisect the
>> problem in gcc :(
>
> Their commit r265241 is what broke the PostgreSQL build.  It also broke the
> compiler itself--at that commit it was no longer possible to build itself.
> I had to --disable-bootstrap in order to get a r265241 compiler to test
> PostgreSQL on.

It seems they have done a API change around some kind of "range" analysis, 
which must have been incomplete.

> Their commit r265375 fixed the ability to compile itself, but built
> PostgreSQL binaries remain broken there and thereafter.
>
> |...]

Thanks a lot for this investigation! I can fill in a gcc bug report. There 
would be a enormous work to narrow it down to a small test case, it is 
unclear how they can act about it, but at least they would know.

-- 
Fabien.


Re: Buildfarm failures for hash indexes: buffer leaks

From
Andres Freund
Date:
On 2018-10-27 08:18:12 +0200, Fabien COELHO wrote:
> 
> Hello Jeff,
> 
> > > I suspect the easiest thing to narrow it down would be to bisect the
> > > problem in gcc :(
> > 
> > Their commit r265241 is what broke the PostgreSQL build.  It also broke the
> > compiler itself--at that commit it was no longer possible to build itself.
> > I had to --disable-bootstrap in order to get a r265241 compiler to test
> > PostgreSQL on.
> 
> It seems they have done a API change around some kind of "range" analysis,
> which must have been incomplete.
> 
> > Their commit r265375 fixed the ability to compile itself, but built
> > PostgreSQL binaries remain broken there and thereafter.
> > 
> > |...]
> 
> Thanks a lot for this investigation! I can fill in a gcc bug report. There
> would be a enormous work to narrow it down to a small test case, it is
> unclear how they can act about it, but at least they would know.

Have you done so? If so, what's the bug number?

Greetings,

Andres Freund


Re: Buildfarm failures for hash indexes: buffer leaks

From
Fabien COELHO
Date:
>>> Their commit r265375 fixed the ability to compile itself, but built
>>> PostgreSQL binaries remain broken there and thereafter.
>>>
>>> |...]
>>
>> Thanks a lot for this investigation! I can fill in a gcc bug report. There
>> would be a enormous work to narrow it down to a small test case, it is
>> unclear how they can act about it, but at least they would know.
>
> Have you done so? If so, what's the bug number?

Not yet. There was no answer to my suggestion, so I did not feel like 
that urgent. I'll try to do it over next week.

-- 
Fabien.


Re: Buildfarm failures for hash indexes: buffer leaks

From
Andres Freund
Date:
Hi,

On 2018-11-01 22:52:19 +0100, Fabien COELHO wrote:
> 
> > > > Their commit r265375 fixed the ability to compile itself, but built
> > > > PostgreSQL binaries remain broken there and thereafter.
> > > > 
> > > > |...]
> > > 
> > > Thanks a lot for this investigation! I can fill in a gcc bug report. There
> > > would be a enormous work to narrow it down to a small test case, it is
> > > unclear how they can act about it, but at least they would know.
> > 
> > Have you done so? If so, what's the bug number?
> 
> Not yet. There was no answer to my suggestion, so I did not feel like that
> urgent. I'll try to do it over next week.

FWIW, it seems that gcc's trunk works again. But I'm not sure this isn't
just an accident and the optimization's introduced in the above revision
aren't still broken.

https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=moonjelly&br=HEAD

Greetings,

Andres Freund


Re: Buildfarm failures for hash indexes: buffer leaks

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> FWIW, it seems that gcc's trunk works again. But I'm not sure this isn't
> just an accident and the optimization's introduced in the above revision
> aren't still broken.
> https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=moonjelly&br=HEAD

Yeah, I saw that moonjelly had gone green.  Are you going to update
serinus' compiler soon?

(And while I'm bugging you, would you fix your buildfarm menagerie
to build REL_11_STABLE?)

            regards, tom lane


Re: Buildfarm failures for hash indexes: buffer leaks

From
Andres Freund
Date:
Hi,


On 2018-11-19 17:32:37 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > FWIW, it seems that gcc's trunk works again. But I'm not sure this isn't
> > just an accident and the optimization's introduced in the above revision
> > aren't still broken.
> > https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=moonjelly&br=HEAD
> 
> Yeah, I saw that moonjelly had gone green.  Are you going to update
> serinus' compiler soon?

I'm using debian's gcc-snapshot package, which updates a bit less
frequently than what Fabien does. It'll auto-update as soon as that has
happened.

What disconcerts me a bit is that afaict there wasn't an intentional fix
in the relevant area (disabling value range propagation fixes the
miscompilation), so I'm not quite sure.


> (And while I'm bugging you, would you fix your buildfarm menagerie
> to build REL_11_STABLE?)

Oh, hum. I thought I had. But I did it only for the LLVM enabled
animals. Results should be coming up soon.

Greetings,

Andres Freund


Re: Buildfarm failures for hash indexes: buffer leaks

From
Andres Freund
Date:
On 2018-11-19 14:38:04 -0800, Andres Freund wrote:
> Hi,
> 
> 
> On 2018-11-19 17:32:37 -0500, Tom Lane wrote:
> > Andres Freund <andres@anarazel.de> writes:
> > > FWIW, it seems that gcc's trunk works again. But I'm not sure this isn't
> > > just an accident and the optimization's introduced in the above revision
> > > aren't still broken.
> > > https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=moonjelly&br=HEAD
> > 
> > Yeah, I saw that moonjelly had gone green.  Are you going to update
> > serinus' compiler soon?
> 
> I'm using debian's gcc-snapshot package, which updates a bit less
> frequently than what Fabien does. It'll auto-update as soon as that has
> happened.
> 
> What disconcerts me a bit is that afaict there wasn't an intentional fix
> in the relevant area (disabling value range propagation fixes the
> miscompilation), so I'm not quite sure.

Ooops, somehow deleted too much when editing the sentence. I was trying
to say that I'm not sure whether this is actually fixed, or a bug was
just hidden.

The issue is "fixed" in :
https://github.com/gcc-mirror/gcc/commit/48625f587d786089ce1e245f192bcf706f8fc1fc
and had been "caused" (or surfaced) in:
https://github.com/gcc-mirror/gcc/commit/be44111ed7ea7611d324d4d570e2f2d380f784b4

Greetings,

Andres Freund