Thread: Getting server crash after running sqlsmith

Getting server crash after running sqlsmith

From

tushar

Date:

28 March 2017, 19:12:17

Hi,
After runinng sqlsmith against latest sources of PG v10 , able to see a crash -

here is the standalone testcase -

Make sure 'logging_collector=on' in postgresql.conf file
Connect to psql terminal ,run this query
postgres=# select
          80 as c0,
          pg_catalog.pg_backend_pid() as c1,
          68 as c2,
          subq_1.c0 as c3,
          subq_1.c0 as c4
        from
          (select
                ref_0.specific_schema as c0
              from
                information_schema.role_routine_grants as ref_0,
                lateral (select
                      ref_0.grantor as c0,
                      50 as c1
                    from
                      information_schema.routines as ref_1
                    where (63 = 86)
                      or (pg_catalog.pg_advisory_lock(
                          cast(ref_1.result_cast_datetime_precision as integer),
                          cast(pg_catalog.bttidcmp(
                            cast(null as tid),
                            cast(null as tid)) as integer)) is NULL)
                    limit 143) as subq_0
              where pg_catalog.pg_rotate_logfile() is NULL) as subq_1
        where 50 <> 45;
c0 | c1 | c2 | c3 | c4
----+----+----+----+----
(0 rows)

postgres=# select 1;
FATAL: terminating connection due to unexpected postmaster exit
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!> \q

here is the stack trace -
[centos@tushar-centos bin]$ gdb -q -c mdata/core.4254 /home/centos/pg10_28march/postgresql/edbpsql/bin/postgres
Reading symbols from /home/centos/pg10_28march/postgresql/edbpsql/bin/postgres...done.
[New Thread 4254]
Missing separate debuginfo for
Try: yum --disablerepo='*' --enablerepo='*-debug*' install /usr/lib/debug/.build-id/5f/7d4ef6f6ba15505d3c42a7a09e2a7ca9ae5ba6
--
--
Loaded symbols for /lib/libkrb5support.so.0
Reading symbols from /lib/libkeyutils.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/libkeyutils.so.1
Reading symbols from /lib/libselinux.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/libselinux.so.1
Reading symbols from /lib/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/libnss_files.so.2
Core was generated by `/home/centos/pg10_28march/postgresql/edbpsql/bin/postgres -D mdata'.
Program terminated with signal 11, Segmentation fault.
#0 0x00a75424 in __kernel_vsyscall ()
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.132.el6.i686 keyutils-libs-1.4-5.el6.i686 krb5-libs-1.10.3-57.el6.i686 libcom_err-1.41.12-22.el6.i686 libselinux-2.0.94-7.el6.i686 openssl-1.0.1e-48.el6_8.4.i686 zlib-1.2.3-29.el6.i686
(gdb) bt
#0 0x00a75424 in __kernel_vsyscall ()
#1 0x00aa1d7b in sigprocmask () from /lib/libc.so.6
#2 0x083d2d79 in sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5081
#3 <signal handler called>
#4 0x00a75424 in __kernel_vsyscall ()
#5 0x00aa1d7b in sigprocmask () from /lib/libc.so.6
#6 0x083d2d79 in sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5081
#7 <signal handler called>
#8 0x00a75424 in __kernel_vsyscall ()
#9 0x00aa1d7b in sigprocmask () from /lib/libc.so.6
#10 0x083d2d79 in sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5081
#11 <signal handler called>
#12 0x00a75424 in __kernel_vsyscall ()
--
--
#52380 0x00a75424 in __kernel_vsyscall ()
#52381 0x00aa1d7b in sigprocmask () from /lib/libc.so.6
#52382 0x083d2d79 in sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5081
#52383 <signal handler called>
#52384 0x00a75424 in __kernel_vsyscall ()
#52385 0x00b5208d in ___newselect_nocancel () from /lib/libc.so.6
#52386 0x083ce40e in ServerLoop () at postmaster.c:1693
#52387 0x083cdbcb in PostmasterMain (argc=3, argv=0x957ca10) at postmaster.c:1337
#52388 0x083236fc in main (argc=3, argv=0x957ca10) at main.c:228

-- 
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company

Re: Getting server crash after running sqlsmith

From

Tom Lane

Date:

29 March 2017, 00:36:11

tushar <tushar.ahuja@enterprisedb.com> writes:
> After runinng sqlsmith against latest sources of PG v10  , able to see a 
> crash -

Hm ... I don't see a crash here, but I wonder whether you have parameters
set that would cause this query to be run as a parallel query?  Because
pg_rotate_logfile() is marked as parallel-safe in pg_proc, which seems
probably insane.
        regards, tom lane

Re: Getting server crash after running sqlsmith

From

Robert Haas

Date:

29 March 2017, 06:57:51

On Tue, Mar 28, 2017 at 2:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> tushar <tushar.ahuja@enterprisedb.com> writes:
>> After runinng sqlsmith against latest sources of PG v10  , able to see a
>> crash -
>
> Hm ... I don't see a crash here, but I wonder whether you have parameters
> set that would cause this query to be run as a parallel query?  Because
> pg_rotate_logfile() is marked as parallel-safe in pg_proc, which seems
> probably insane.

/me blinks

Uh, what's insane about that?  All it does is test a GUC (which is
surely parallel-safe) and call SendPostmasterSignal (which seems safe,
too).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Getting server crash after running sqlsmith

From

Tom Lane

Date:

29 March 2017, 07:23:29

Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Mar 28, 2017 at 2:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Hm ... I don't see a crash here, but I wonder whether you have parameters
>> set that would cause this query to be run as a parallel query?  Because
>> pg_rotate_logfile() is marked as parallel-safe in pg_proc, which seems
>> probably insane.

> /me blinks

> Uh, what's insane about that?  All it does is test a GUC (which is
> surely parallel-safe) and call SendPostmasterSignal (which seems safe,
> too).

Well, if you don't like that theory, what's yours?
        regards, tom lane

Re: Getting server crash after running sqlsmith

From

Thomas Munro

Date:

29 March 2017, 07:49:10

On Wed, Mar 29, 2017 at 2:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Tue, Mar 28, 2017 at 2:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Hm ... I don't see a crash here, but I wonder whether you have parameters
>>> set that would cause this query to be run as a parallel query?  Because
>>> pg_rotate_logfile() is marked as parallel-safe in pg_proc, which seems
>>> probably insane.
>
>> /me blinks
>
>> Uh, what's insane about that?  All it does is test a GUC (which is
>> surely parallel-safe) and call SendPostmasterSignal (which seems safe,
>> too).
>
> Well, if you don't like that theory, what's yours?

I can't reproduce this either.  But here's a theory: this query
signals the postmaster repeatedly fast, and with just the right kind
of difficulty scheduling/waking to the postmaster to deliver the
signal on an overloaded machine, maybe there is always a new SIGUSR1
and PMSIGNAL_ROTATE_LOGFILE waiting once the signal handler reaches
PG_SETMASK(&UnBlockSig), at which point it immediately recurses into
the signal handler until it blows the stack.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Getting server crash after running sqlsmith

From

Robert Haas

Date:

29 March 2017, 07:49:42

On Tue, Mar 28, 2017 at 9:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Tue, Mar 28, 2017 at 2:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Hm ... I don't see a crash here, but I wonder whether you have parameters
>>> set that would cause this query to be run as a parallel query?  Because
>>> pg_rotate_logfile() is marked as parallel-safe in pg_proc, which seems
>>> probably insane.
>
>> /me blinks
>
>> Uh, what's insane about that?  All it does is test a GUC (which is
>> surely parallel-safe) and call SendPostmasterSignal (which seems safe,
>> too).
>
> Well, if you don't like that theory, what's yours?

Gremlins?

The stack trace seems to show that the process is receiving SIGUSR1 at
a very high rate.  Every time sigusr1_handler() reaches
PG_SETMASK(&UnBlockSig), it immediately gets a SIGUSR1 and jumps back
into sigusr1_handler().  Now, this seems like a design flaw in
sigusr1_handler().  Likely the operating system blocks SIGUSR1 on
entry to the signal handler so that it's not possible for a high rate
of signal delivery to blow out the stack, but we forcibly unblock it
before returning, thus exposing ourselves to blowing out the stack.
And we have, apparently, no stack depth check here nor any other way
of preventing the infinite recursion.

I imagine here the behavior is platform-dependent, but I'd guess that
select pg_current_logfile() from generate_series(1,1000000) g might
reproduce this on affected platforms with or without parallel query in
the mix.  It looks like we've conveniently provided both a function
that can be used to SIGUSR1 the heck out of the postmaster and a
postmaster that is, at least on such platforms, vulnerable to crashing
if you do that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Getting server crash after running sqlsmith

From

tushar

Date:

30 March 2017, 14:33:19

On 03/29/2017 12:06 AM, Tom Lane wrote:
> Hm ... I don't see a crash here,
I am getting this issue only on Linux3.
>   but I wonder whether you have parameters
> set that would cause this query to be run as a parallel query?  Because
> pg_rotate_logfile() is marked as parallel-safe in pg_proc, which seems
> probably insane.
No, i have not changed any parameters except logging_collector=on in 
postgresql.conf file.

-- 
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company