Thread: segfault in geqo on experimental gcc animal

segfault in geqo on experimental gcc animal

From
Andres Freund
Date:
Hi,

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=moonjelly&dt=2019-11-09%2010%3A17%3A06

shows a failure, including a backtrace:

======-=-====== stack trace: pgsql.build/src/test/regress/tmp_check/data/core ======-=-======
[New LWP 42902]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: fabien regression [local] SELECT                                    '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00000000006d962b in gimme_tour (root=root@entry=0x1cfb4b0, edge_table=edge_table@entry=0x1d3afc0,
new_gene=<optimizedout>, num_gene=5) at geqo_erx.c:209
 
209            remove_gene(root, new_gene[i - 1], edge_table[(int) new_gene[i - 1]], edge_table);
#0  0x00000000006d962b in gimme_tour (root=root@entry=0x1cfb4b0, edge_table=edge_table@entry=0x1d3afc0,
new_gene=<optimizedout>, num_gene=5) at geqo_erx.c:209
 
#1  0x00000000006da0a8 in geqo (root=0x1cfb4b0, number_of_rels=<optimized out>, initial_rels=<optimized out>) at
geqo_main.c:190
#2  0x00000000006de084 in make_one_rel (root=root@entry=0x1cfb4b0, joinlist=joinlist@entry=0x1d0a868) at
allpaths.c:227
#3  0x0000000000701d19 in query_planner (root=root@entry=0x1cfb4b0, qp_callback=qp_callback@entry=0x702300
<standard_qp_callback>,qp_extra=qp_extra@entry=0x7ffd46b55a60) at planmain.c:269
 
#4  0x0000000000706844 in grouping_planner () at planner.c:2054
#5  0x00000000007093c7 in subquery_planner (glob=glob@entry=0x1cfb418, parse=parse@entry=0x1cd77b8,
parent_root=parent_root@entry=0x0,hasRecursion=hasRecursion@entry=false, tuple_fraction=tuple_fraction@entry=0) at
planner.c:1014
#6  0x000000000070a803 in standard_planner (parse=0x1cd77b8, cursorOptions=256, boundParams=<optimized out>) at
planner.c:406
#7  0x00000000007cb1dc in pg_plan_query (querytree=0x1cd77b8, cursorOptions=256, boundParams=0x0) at postgres.c:873
#8  0x00000000007cb2be in pg_plan_queries (querytrees=0x1cfb3c0, cursorOptions=cursorOptions@entry=256,
boundParams=boundParams@entry=0x0)at postgres.c:963
 
#9  0x00000000007cb618 in exec_simple_query () at postgres.c:1154
#10 0x00000000007cd384 in PostgresMain (argc=<optimized out>, argv=argv@entry=0x1c23058, dbname=<optimized out>,
username=<optimizedout>) at postgres.c:4278
 
#11 0x000000000074b574 in BackendRun (port=0x1c1c650) at postmaster.c:4498
#12 BackendStartup (port=0x1c1c650) at postmaster.c:4189
#13 ServerLoop () at postmaster.c:1727
#14 0x000000000074c34d in PostmasterMain (argc=argc@entry=8, argv=argv@entry=0x1bf35b0) at postmaster.c:1400
#15 0x0000000000491f41 in main (argc=8, argv=0x1bf35b0) at main.c:210
$1 = {si_signo = 11, si_errno = 0, si_code = 1, _sifields = {_pad = {30650304, -12, 0 <repeats 26 times>}, _kill =
{si_pid= 30650304, si_uid = 4294967284}, _timer = {si_tid = 30650304, si_overrun = -12, si_sigval = {sival_int = 0,
sival_ptr= 0x0}}, _rt = {si_pid = 30650304, si_uid = 4294967284, si_sigval = {sival_int = 0, sival_ptr = 0x0}},
_sigchld= {si_pid = 30650304, si_uid = 4294967284, si_status = 0, si_utime = 0, si_stime = 0}, _sigfault = {si_addr =
0xfffffff401d3afc0,_addr_lsb = 0, _addr_bnd = {_lower = 0x0, _upper = 0x0}}, _sigpoll = {si_band = -51508957248, si_fd
=0}}}
 

I don't think there's been any relevant code changes since the last
success.

last success:
2019-11-09 09:20:28.346 CET [28785:1] LOG:  starting PostgreSQL 13devel on x86_64-pc-linux-gnu, compiled by gcc (GCC)
10.0.020191102 (experimental), 64-bit
 

first failure:
2019-11-09 11:19:36.277 CET [42512:1] LOG:  starting PostgreSQL 13devel on x86_64-pc-linux-gnu, compiled by gcc (GCC)
10.0.020191109 (experimental), 64-bit
 


so it sure looks like a gcc upgrade caused the failure. But it's not
clear wheter it's a compiler bug, or some undefined behaviour that
triggers the bug.

Fabien, any chance to either bisect or get a bit more information on the
backtrace?


Greetings,

Andres Freund



Re: segfault in geqo on experimental gcc animal

From
Fabien COELHO
Date:
Hello Andres,

> I don't think there's been any relevant code changes since the last
> success.
>
> last success:
> 2019-11-09 09:20:28.346 CET [28785:1] LOG:  starting PostgreSQL 13devel on x86_64-pc-linux-gnu, compiled by gcc (GCC)
10.0.020191102 (experimental), 64-bit
 
>
> first failure:
> 2019-11-09 11:19:36.277 CET [42512:1] LOG:  starting PostgreSQL 13devel on x86_64-pc-linux-gnu, compiled by gcc (GCC)
10.0.020191109 (experimental), 64-bit
 
>
>
> so it sure looks like a gcc upgrade caused the failure. But it's not
> clear wheter it's a compiler bug, or some undefined behaviour that
> triggers the bug.
>
> Fabien, any chance to either bisect or get a bit more information on the
> backtrace?

There is a promising "keep_error_builds" option in buildfarm settings, but 
it does not seem to be used anywhere in the scripts. Well, I can probably 
relaunch by hand.

However, given the experimental nature of the setup, I think that the most 
probable cause is a newly introduced gcc bug, so I'd suggest to wait to 
check whether the issue persist before spending time on that, and if it 
persists to investigate further to either report a bug to gcc or pg, 
depending.

Also, I'll recompile gcc before the next weekly builds.

-- 
Fabien.



Re: segfault in geqo on experimental gcc animal

From
Fabien COELHO
Date:
>> so it sure looks like a gcc upgrade caused the failure. But it's not
>> clear wheter it's a compiler bug, or some undefined behaviour that
>> triggers the bug.
>> 
>> Fabien, any chance to either bisect or get a bit more information on 
>> the backtrace?
>
> There is a promising "keep_error_builds" option in buildfarm settings, 
> but it does not seem to be used anywhere in the scripts. Well, I can 
> probably relaunch by hand.
>
> However, given the experimental nature of the setup, I think that the 
> most probable cause is a newly introduced gcc bug, so I'd suggest to 
> wait to check whether the issue persist before spending time on that, 
> and if it persists to investigate further to either report a bug to gcc 
> or pg, depending.
>
> Also, I'll recompile gcc before the next weekly builds.

I did some manual testing.

All versions are tested failed miserably (I tested master, 12, 11, 10, 
9.6…). High probability that it is a newly introduced gcc bug, however pg 
is not a nice self contain tested case to submit to gcc for debugging:-(

I suggest to ignore for the time being, and if the problem persist I'll 
try to investigate to detect which gcc commit caused the regression.

-- 
Fabien.

Re: segfault in geqo on experimental gcc animal

From
Fabien COELHO
Date:
Hello,

I did a (slow) dichotomy on gcc sources which determined that gcc r277979 
was the culprit, then I started a bug report which showed that the issue 
was already reported this morning by Martin Liška, including a nice 
example isolated from sources. See:

     https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92506

-- 
Fabien.

Re: segfault in geqo on experimental gcc animal

From
Martin Liška
Date:
Hi.

Yep, I build periodically PostgreSQL package in openSUSE with the
latest GCC and so
that I identified that and isolated to a simple test-case. I would expect a fix
today or tomorrow.

See you,
Martin

On Thu, 14 Nov 2019 at 16:46, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>
>
> Hello,
>
> I did a (slow) dichotomy on gcc sources which determined that gcc r277979
> was the culprit, then I started a bug report which showed that the issue
> was already reported this morning by Martin Liška, including a nice
> example isolated from sources. See:
>
>         https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92506
>
> --
> Fabien.



Re: segfault in geqo on experimental gcc animal

From
Fabien COELHO
Date:
> Yep, I build periodically PostgreSQL package in openSUSE with the latest 
> GCC and so that I identified that and isolated to a simple test-case. I 
> would expect a fix today or tomorrow.

Indeed, the gcc issue reported seems fixed by gcc r278259. I'm updating 
moonjelly gcc to check if this solves pg compilation woes.

-- 
Fabien.



Re: segfault in geqo on experimental gcc animal

From
Martin Liška
Date:
Yes, after the revision I see other failing tests like:
...
     select_having                ... ok           16 ms
     subselect                    ... FAILED       92 ms
     union                        ... FAILED       77 ms
     case                         ... ok           32 ms
     join                         ... FAILED      239 ms
     aggregates                   ... FAILED      136 ms
     transactions                 ... ok           59 ms
...

I'm going to investigate that and will inform you guys.

Martin

On Fri, 15 Nov 2019 at 11:56, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>
>
> > Yep, I build periodically PostgreSQL package in openSUSE with the latest
> > GCC and so that I identified that and isolated to a simple test-case. I
> > would expect a fix today or tomorrow.
>
> Indeed, the gcc issue reported seems fixed by gcc r278259. I'm updating
> moonjelly gcc to check if this solves pg compilation woes.
>
> --
> Fabien.



Re: segfault in geqo on experimental gcc animal

From
Fabien COELHO
Date:
> Yes, after the revision I see other failing tests like:

Indeed, I can confirm there are still 18/195 fails with the updated gcc.

> I'm going to investigate that and will inform you guys.

Great, thanks!

-- 
Fabien.



Re: segfault in geqo on experimental gcc animal

From
Martin Liška
Date:
Heh, it's me who now breaks postgresql build:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92529

Martin

On Fri, 15 Nov 2019 at 13:01, Fabien COELHO
<fabien.coelho@mines-paristech.fr> wrote:
>
>
> > Yes, after the revision I see other failing tests like:
>
> Indeed, I can confirm there are still 18/195 fails with the updated gcc.
>
> > I'm going to investigate that and will inform you guys.
>
> Great, thanks!
>
> --
> Fabien.



Re: segfault in geqo on experimental gcc animal

From
Martin Liška
Date:
Hello.

The issue is resolved now and tests are fine for me.

Martin

On Fri, 15 Nov 2019 at 13:11, Martin Liška <marxin.liska@gmail.com> wrote:
>
> Heh, it's me who now breaks postgresql build:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92529
>
> Martin
>
> On Fri, 15 Nov 2019 at 13:01, Fabien COELHO
> <fabien.coelho@mines-paristech.fr> wrote:
> >
> >
> > > Yes, after the revision I see other failing tests like:
> >
> > Indeed, I can confirm there are still 18/195 fails with the updated gcc.
> >
> > > I'm going to investigate that and will inform you guys.
> >
> > Great, thanks!
> >
> > --
> > Fabien.



Re: segfault in geqo on experimental gcc animal

From
Fabien COELHO
Date:
Hello Martin,

> The issue is resolved now and tests are fine for me.

I recompiled gcc trunk and the moonjelly is back to green.

Thanks!

-- 
Fabien.