Thread: Crash and core on 10.1 and 10.2

Crash and core on 10.1 and 10.2

From
Kelly Burkhart
Date:
Hello, I've had two core dumps in the last couple of weeks.  The most recent, yesterday was on version 10.2:

(gdb) bt
#0  0x00007f317a043886 in get_next_seq () from /lib64/libc.so.6
#1  0x00007f317a044acc in strcoll_l () from /lib64/libc.so.6
#2  0x00000000007ced5f in varstrfastcmp_locale ()
#3  0x000000000081b6fb in qsort_ssup ()
#4  0x000000000081d8e1 in tuplesort_performsort ()
#5  0x00000000005eaf00 in finalize_aggregates ()
#6  0x00000000005ebe42 in ExecAgg ()
#7  0x00000000005e30e8 in ExecProcNodeInstr ()
#8  0x00000000005ded12 in standard_ExecutorRun ()
#9  0x0000000000589380 in ExplainOnePlan ()
#10 0x0000000000589667 in ExplainOneQuery ()
#11 0x0000000000589b74 in ExplainQuery ()
#12 0x000000000070036b in standard_ProcessUtility ()
#13 0x00000000006fd9f7 in PortalRunUtility ()
#14 0x00000000006fe702 in FillPortalStore ()
#15 0x00000000006ff110 in PortalRun ()
#16 0x00000000006fb163 in exec_simple_query ()
#17 0x00000000006fc41c in PostgresMain ()
#18 0x0000000000475c7d in ServerLoop ()
#19 0x0000000000697449 in PostmasterMain ()
#20 0x0000000000476691 in main ()

The earlier was last week on 10.1:

(gdb) bt
#0  0x00007f6e1f09d8ea in get_next_seq () from /lib64/libc.so.6
#1  0x00007f6e1f09eacc in strcoll_l () from /lib64/libc.so.6
#2  0x00000000007cf70b in varstr_cmp ()
#3  0x000000000075f25b in compareJsonbContainers ()
#4  0x000000000075d8f2 in jsonb_eq ()
#5  0x00000000005db2bc in ExecInterpExpr ()
#6  0x00000000005e3b09 in ExecScan ()
#7  0x00000000005de352 in standard_ExecutorRun ()
#8  0x00000000005e262a in ParallelQueryMain ()
#9  0x00000000004db981 in ParallelWorkerMain ()
#10 0x00000000006895ff in StartBackgroundWorker ()
#11 0x000000000069470d in maybe_start_bgworkers ()
#12 0x0000000000695245 in sigusr1_handler ()
#13 <signal handler called>
#14 0x00007f6e1f0f9783 in __select_nocancel () from /lib64/libc.so.6
#15 0x0000000000475327 in ServerLoop ()
#16 0x0000000000696409 in PostmasterMain ()
#17 0x0000000000476671 in main ()


These are running on a centos 7 host, dell r640.  My modifications to the default postgresql.conf are:

max_connections = 1000
shared_buffers = 12GB
work_mem = 1GB
maintenance_work_mem = 1GB
constraint_exclusion = partition
track_activities = on
track_counts = on
track_functions = all
track_io_timing = on
autovacuum = on
datestyle = 'iso, ymd'
log_destination = 'csvlog'
logging_collector = on
log_directory = '/opt/pgdata/dblog'
log_filename = 'postgresql-%w.log'
log_file_mode = 0644
log_truncate_on_rotation = on
log_rotation_age = 1d
log_rotation_size = 0
log_min_duration_statement = 0
log_checkpoints = on
log_connections = on
log_disconnections = on
log_duration = on
log_hostname = off
log_line_prefix = '#%r:%p:%m:%i# '
log_lock_Waits = on
log_statement = 'none'
log_temp_files = 0
update_process_title = on
temp_tablespaces = 'tb_temp'
idle_in_transaction_session_timeout = 300000     # in milliseconds, 0 is disabled


The database is quite busy at the time of the crash with 99% of statements being very simple.  Each time the crash happened during an ad-hoc query of a table with a jsonb field and a gin index on that field.  The most recent query:

server process (PID 318386) was terminated by signal 11: Segmentation fault  Failed process was running: explain analyze select count(distinct other_keys->>'shard_dt') from krb_dataset_shard where dataset_id=22 and other_keys->>'symbol' = 'AAPL';

The table looks like this:

                           Table "tb_us.krb_dataset_shard"
   Column   |  Type  |                           Modifiers                            
------------+--------+----------------------------------------------------------------
 id         | bigint | not null default nextval('krb_dataset_shard_id_seq'::regclass)
 dataset_id | bigint | not null
 other_keys | jsonb  | not null default '{}'::jsonb
Indexes:
    "krb_dataset_shard_pkey" PRIMARY KEY, btree (id), tablespace "tb_idx_tablespace"
    "krb_dataset_shard_ak1" btree (dataset_id), tablespace "tb_idx_tablespace"
    "krb_dataset_shard_ak2" gin (other_keys), tablespace "tb_idx_tablespace"
Foreign-key constraints:
    "dataset_id_fk" FOREIGN KEY (dataset_id) REFERENCES dataset(id)


I do not have the query from the previous crash.

Please let me know if there is any other information that I can provide.

Thanks,

-Kelly



Re: Crash and core on 10.1 and 10.2

From
Tom Lane
Date:
Kelly Burkhart <kelly.burkhart@gmail.com> writes:
> Hello, I've had two core dumps in the last couple of weeks.  The most
> recent, yesterday was on version 10.2:

> (gdb) bt
> #0  0x00007f317a043886 in get_next_seq () from /lib64/libc.so.6
> #1  0x00007f317a044acc in strcoll_l () from /lib64/libc.so.6
> #2  0x00000000007ced5f in varstrfastcmp_locale ()
> #3  0x000000000081b6fb in qsort_ssup ()
> #4  0x000000000081d8e1 in tuplesort_performsort ()

Hm.  If you'd just showed this one, my thoughts might bend towards a bug
in our sort abbreviation logic, which is relatively new ...

> (gdb) bt
> #0  0x00007f6e1f09d8ea in get_next_seq () from /lib64/libc.so.6
> #1  0x00007f6e1f09eacc in strcoll_l () from /lib64/libc.so.6
> #2  0x00000000007cf70b in varstr_cmp ()
> #3  0x000000000075f25b in compareJsonbContainers ()
> #4  0x000000000075d8f2 in jsonb_eq ()

... but this stack trace is not going anywhere near that code.  The
common factor is just strcoll_l(), raising the possibility that you're
dealing with a glibc bug, or perhaps corrupted locale data on your
machine.  Are you up-to-date on glibc patches?

            regards, tom lane


Re: Crash and core on 10.1 and 10.2

From
Kelly Burkhart
Date:
We're on centos 7.0, glibc-2.17-55.  Current centos is 7.4, glibc-2.17-196.  We have some hosts on the newer centos, I'll ask our sysadmin about upgrading.

Do you know of glibc issues that would be of relevance?

Our main production database has been running the same centos and pg 9.4.4 without any issue for a very long time.  (Although we use no json or gin stuff there).

-K

On Thu, Mar 8, 2018 at 11:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Kelly Burkhart <kelly.burkhart@gmail.com> writes:
> Hello, I've had two core dumps in the last couple of weeks.  The most
> recent, yesterday was on version 10.2:

> (gdb) bt
> #0  0x00007f317a043886 in get_next_seq () from /lib64/libc.so.6
> #1  0x00007f317a044acc in strcoll_l () from /lib64/libc.so.6
> #2  0x00000000007ced5f in varstrfastcmp_locale ()
> #3  0x000000000081b6fb in qsort_ssup ()
> #4  0x000000000081d8e1 in tuplesort_performsort ()

Hm.  If you'd just showed this one, my thoughts might bend towards a bug
in our sort abbreviation logic, which is relatively new ...

> (gdb) bt
> #0  0x00007f6e1f09d8ea in get_next_seq () from /lib64/libc.so.6
> #1  0x00007f6e1f09eacc in strcoll_l () from /lib64/libc.so.6
> #2  0x00000000007cf70b in varstr_cmp ()
> #3  0x000000000075f25b in compareJsonbContainers ()
> #4  0x000000000075d8f2 in jsonb_eq ()

... but this stack trace is not going anywhere near that code.  The
common factor is just strcoll_l(), raising the possibility that you're
dealing with a glibc bug, or perhaps corrupted locale data on your
machine.  Are you up-to-date on glibc patches?

                        regards, tom lane