Thread: Buildfarm failures for hash indexes: buffer leaks
Hi all, Looking at the buildfarm, serinus and moonjelly have just complained about the failure of $subject: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2018-10-22%2006%3A34%3A02 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=moonjelly&dt=2018-10-20%2015%3A17%3A02 The first failure is unrelated to the involved commits, as they touched completely different areas of the code: INSERT INTO hash_split_heap SELECT a/2 FROM generate_series(1, 25000) a; + WARNING: buffer refcount leak: [6481] (rel=base/16384/32349, blockNum=156, flags=0x93800000, refcount=1 1) And versions older than HEAD do not complain. Any thoughts? -- Michael
Attachment
Hello Michaël, > The first failure is unrelated to the involved commits, as they touched > completely different areas of the code: > INSERT INTO hash_split_heap SELECT a/2 FROM generate_series(1, 25000) a; > + WARNING: buffer refcount leak: [6481] (rel=base/16384/32349, blockNum=156, flags=0x93800000, refcount=1 1) > > And versions older than HEAD do not complain. > > Any thoughts? Both animals use gcc experimental versions, which may rather underline a new bug in gcc head rather than an existing issue in pg. Or not. -- Fabien.
On Mon, Oct 22, 2018 at 1:42 PM Fabien COELHO <coelho@cri.ensmp.fr> wrote: > > Hello Michaël, > > > The first failure is unrelated to the involved commits, as they touched > > completely different areas of the code: > > INSERT INTO hash_split_heap SELECT a/2 FROM generate_series(1, 25000) a; > > + WARNING: buffer refcount leak: [6481] (rel=base/16384/32349, blockNum=156, flags=0x93800000, refcount=1 1) > > > > And versions older than HEAD do not complain. > > > > Any thoughts? > > Both animals use gcc experimental versions, which may rather underline a > new bug in gcc head rather than an existing issue in pg. Or not. > It is possible, but what could be the possible theory? The warning indicates that somewhere we forgot to call ReleaseBuffer. Today, I had reviewed at the hash index code related to test case that is failing but didn't find any obvious problem. What should we our next step? Do we want to change gcc version and see? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Amit Kapila <amit.kapila16@gmail.com> writes: > On Mon, Oct 22, 2018 at 1:42 PM Fabien COELHO <coelho@cri.ensmp.fr> wrote: >> Both animals use gcc experimental versions, which may rather underline a >> new bug in gcc head rather than an existing issue in pg. Or not. > It is possible, but what could be the possible theory? It seems like the two feasible theories are (1) gcc bug, or (2) buffer leak that only occurs in very narrow circumstances, perhaps from a race condition. Given that the hash index code hasn't changed meaningfully in several months, I thought (1) seemed more probable. regards, tom lane
Hello Tom & Amit, >>> Both animals use gcc experimental versions, which may rather underline a >>> new bug in gcc head rather than an existing issue in pg. Or not. > >> It is possible, but what could be the possible theory? > > It seems like the two feasible theories are (1) gcc bug, or (2) buffer > leak that only occurs in very narrow circumstances, perhaps from a race > condition. Given that the hash index code hasn't changed meaningfully > in several months, I thought (1) seemed more probable. Yep, that is my thought as well. The problem is that this kind of issue is not simple to wrap-up as a gcc bug report, unlike other earlier instances that I forwarded to clang & gcc dev teams. I'm in favor in waiting before trying to report it, to check whether the probable underlying gcc problem is detected, reported by someone else, and fixed in gcc head. If it persists, then we'll see. -- Fabien.
On 2018-10-23 13:54:31 +0200, Fabien COELHO wrote: > > Hello Tom & Amit, > > > > > Both animals use gcc experimental versions, which may rather underline a > > > > new bug in gcc head rather than an existing issue in pg. Or not. > > > > > It is possible, but what could be the possible theory? > > > > It seems like the two feasible theories are (1) gcc bug, or (2) buffer > > leak that only occurs in very narrow circumstances, perhaps from a race > > condition. Given that the hash index code hasn't changed meaningfully > > in several months, I thought (1) seemed more probable. > > Yep, that is my thought as well. FWIW, my animal 'serinus', which runs debian's gcc-snapshot shows the same problem: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2018-10-22%2006%3A34%3A02 So it seems much more likely to be 1). > The problem is that this kind of issue is not simple to wrap-up as a gcc bug > report, unlike other earlier instances that I forwarded to clang & gcc dev > teams. > > I'm in favor in waiting before trying to report it, to check whether the > probable underlying gcc problem is detected, reported by someone else, and > fixed in gcc head. If it persists, then we'll see. I suspect the easiest thing to narrow it down would be to bisect the problem in gcc :( Greetings, Andres Freund
On Tue, Oct 23, 2018 at 07:50:57AM -0700, Andres Freund wrote: > FWIW, my animal 'serinus', which runs debian's gcc-snapshot shows the same problem: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2018-10-22%2006%3A34%3A02 > > So it seems much more likely to be 1). Thanks all for looking at it. Indeed it looks like a compiler issue here. -- Michael
Attachment
On Tue, Oct 23, 2018 at 10:51 AM Andres Freund <andres@anarazel.de> wrote:
On 2018-10-23 13:54:31 +0200, Fabien COELHO wrote:
>
> Hello Tom & Amit,
>
> > > > Both animals use gcc experimental versions, which may rather underline a
> > > > new bug in gcc head rather than an existing issue in pg. Or not.
> >
> > > It is possible, but what could be the possible theory?
> >
> > It seems like the two feasible theories are (1) gcc bug, or (2) buffer
> > leak that only occurs in very narrow circumstances, perhaps from a race
> > condition. Given that the hash index code hasn't changed meaningfully
> > in several months, I thought (1) seemed more probable.
>
> Yep, that is my thought as well.
FWIW, my animal 'serinus', which runs debian's gcc-snapshot shows the same problem:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2018-10-22%2006%3A34%3A02
So it seems much more likely to be 1).
> The problem is that this kind of issue is not simple to wrap-up as a gcc bug
> report, unlike other earlier instances that I forwarded to clang & gcc dev
> teams.
>
> I'm in favor in waiting before trying to report it, to check whether the
> probable underlying gcc problem is detected, reported by someone else, and
> fixed in gcc head. If it persists, then we'll see.
I suspect the easiest thing to narrow it down would be to bisect the
problem in gcc :(
Their commit r265241 is what broke the PostgreSQL build. It also broke the compiler itself--at that commit it was no longer possible to build itself. I had to --disable-bootstrap in order to get a r265241 compiler to test PostgreSQL on.
Their commit r265375 fixed the ability to compile itself, but built PostgreSQL binaries remain broken there and thereafter.
I ran this on a AWS c5d.4xlarge ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20180912 (ami-0f65671a86f061fcd)
Configuring PostgreSQL with ./configure --enable-cassert is necessary in order for the subsequent "make check" to experience the failure.
I'm using PostgreSQL v12dev commit 5953c99697621174f, but the problem is not sensitive to that and reproduces back to v10.0.
I haven't the slightest idea what to do with this information, and GCC's cryptic SVN commit messages don't offer much insight to me.
Cheers,
Jeff
Hello Jeff, >> I suspect the easiest thing to narrow it down would be to bisect the >> problem in gcc :( > > Their commit r265241 is what broke the PostgreSQL build. It also broke the > compiler itself--at that commit it was no longer possible to build itself. > I had to --disable-bootstrap in order to get a r265241 compiler to test > PostgreSQL on. It seems they have done a API change around some kind of "range" analysis, which must have been incomplete. > Their commit r265375 fixed the ability to compile itself, but built > PostgreSQL binaries remain broken there and thereafter. > > |...] Thanks a lot for this investigation! I can fill in a gcc bug report. There would be a enormous work to narrow it down to a small test case, it is unclear how they can act about it, but at least they would know. -- Fabien.
On 2018-10-27 08:18:12 +0200, Fabien COELHO wrote: > > Hello Jeff, > > > > I suspect the easiest thing to narrow it down would be to bisect the > > > problem in gcc :( > > > > Their commit r265241 is what broke the PostgreSQL build. It also broke the > > compiler itself--at that commit it was no longer possible to build itself. > > I had to --disable-bootstrap in order to get a r265241 compiler to test > > PostgreSQL on. > > It seems they have done a API change around some kind of "range" analysis, > which must have been incomplete. > > > Their commit r265375 fixed the ability to compile itself, but built > > PostgreSQL binaries remain broken there and thereafter. > > > > |...] > > Thanks a lot for this investigation! I can fill in a gcc bug report. There > would be a enormous work to narrow it down to a small test case, it is > unclear how they can act about it, but at least they would know. Have you done so? If so, what's the bug number? Greetings, Andres Freund
>>> Their commit r265375 fixed the ability to compile itself, but built >>> PostgreSQL binaries remain broken there and thereafter. >>> >>> |...] >> >> Thanks a lot for this investigation! I can fill in a gcc bug report. There >> would be a enormous work to narrow it down to a small test case, it is >> unclear how they can act about it, but at least they would know. > > Have you done so? If so, what's the bug number? Not yet. There was no answer to my suggestion, so I did not feel like that urgent. I'll try to do it over next week. -- Fabien.
Hi, On 2018-11-01 22:52:19 +0100, Fabien COELHO wrote: > > > > > Their commit r265375 fixed the ability to compile itself, but built > > > > PostgreSQL binaries remain broken there and thereafter. > > > > > > > > |...] > > > > > > Thanks a lot for this investigation! I can fill in a gcc bug report. There > > > would be a enormous work to narrow it down to a small test case, it is > > > unclear how they can act about it, but at least they would know. > > > > Have you done so? If so, what's the bug number? > > Not yet. There was no answer to my suggestion, so I did not feel like that > urgent. I'll try to do it over next week. FWIW, it seems that gcc's trunk works again. But I'm not sure this isn't just an accident and the optimization's introduced in the above revision aren't still broken. https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=moonjelly&br=HEAD Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > FWIW, it seems that gcc's trunk works again. But I'm not sure this isn't > just an accident and the optimization's introduced in the above revision > aren't still broken. > https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=moonjelly&br=HEAD Yeah, I saw that moonjelly had gone green. Are you going to update serinus' compiler soon? (And while I'm bugging you, would you fix your buildfarm menagerie to build REL_11_STABLE?) regards, tom lane
Hi, On 2018-11-19 17:32:37 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > FWIW, it seems that gcc's trunk works again. But I'm not sure this isn't > > just an accident and the optimization's introduced in the above revision > > aren't still broken. > > https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=moonjelly&br=HEAD > > Yeah, I saw that moonjelly had gone green. Are you going to update > serinus' compiler soon? I'm using debian's gcc-snapshot package, which updates a bit less frequently than what Fabien does. It'll auto-update as soon as that has happened. What disconcerts me a bit is that afaict there wasn't an intentional fix in the relevant area (disabling value range propagation fixes the miscompilation), so I'm not quite sure. > (And while I'm bugging you, would you fix your buildfarm menagerie > to build REL_11_STABLE?) Oh, hum. I thought I had. But I did it only for the LLVM enabled animals. Results should be coming up soon. Greetings, Andres Freund
On 2018-11-19 14:38:04 -0800, Andres Freund wrote: > Hi, > > > On 2018-11-19 17:32:37 -0500, Tom Lane wrote: > > Andres Freund <andres@anarazel.de> writes: > > > FWIW, it seems that gcc's trunk works again. But I'm not sure this isn't > > > just an accident and the optimization's introduced in the above revision > > > aren't still broken. > > > https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=moonjelly&br=HEAD > > > > Yeah, I saw that moonjelly had gone green. Are you going to update > > serinus' compiler soon? > > I'm using debian's gcc-snapshot package, which updates a bit less > frequently than what Fabien does. It'll auto-update as soon as that has > happened. > > What disconcerts me a bit is that afaict there wasn't an intentional fix > in the relevant area (disabling value range propagation fixes the > miscompilation), so I'm not quite sure. Ooops, somehow deleted too much when editing the sentence. I was trying to say that I'm not sure whether this is actually fixed, or a bug was just hidden. The issue is "fixed" in : https://github.com/gcc-mirror/gcc/commit/48625f587d786089ce1e245f192bcf706f8fc1fc and had been "caused" (or surfaced) in: https://github.com/gcc-mirror/gcc/commit/be44111ed7ea7611d324d4d570e2f2d380f784b4 Greetings, Andres Freund