Thread: [HACKERS] SIGSEGV in BRIN autosummarize

[HACKERS] SIGSEGV in BRIN autosummarize

From

Justin Pryzby

Date:

14 October 2017, 06:57:32

I upgraded one of our customers to PG10 Tuesday night, and Wednesday replaced
an BTREE index with BRIN index (WITH autosummarize).

Today I see:
< 2017-10-13 17:22:47.839 -04  >LOG:  server process (PID 32127) was terminated by signal 11: Segmentation fault
< 2017-10-13 17:22:47.839 -04  >DETAIL:  Failed process was running: autovacuum: BRIN summarize public.gtt 747263

postmaster[32127] general protection ip:4bd467 sp:7ffd9b349990 error:0 in postgres[400000+692000]

[pryzbyj@database ~]$ rpm -qa postgresql10
postgresql10-10.0-1PGDG.rhel6.x86_64

Oct 13 17:22:45 database kernel: postmaster[32127] general protection ip:4bd467 sp:7ffd9b349990 error:0 in
postgres[400000+692000]
Oct 13 17:22:47 database abrtd: Directory 'ccpp-2017-10-13-17:22:47-32127' creation detected
Oct 13 17:22:47 database abrt[32387]: Saved core dump of pid 32127 (/usr/pgsql-10/bin/postgres) to
/var/spool/abrt/ccpp-2017-10-13-17:22:47-32127(15040512 bytes)
 

..unfortunately:
Oct 13 17:22:47 database abrtd: Package 'postgresql10-server' isn't signed with proper key
Oct 13 17:22:47 database abrtd: 'post-create' on '/var/spool/abrt/ccpp-2017-10-13-17:22:47-32127' exited with 1
Oct 13 17:22:47 database abrtd: DELETING PROBLEM DIRECTORY '/var/spool/abrt/ccpp-2017-10-13-17:22:47-32127'

postgres=# SELECT * FROM bak_postgres_log_2017_10_13_1700 WHERE pid=32127 ORDER BY log_time DESC LIMIT 9;
-[ RECORD 1
]----------+---------------------------------------------------------------------------------------------------------
log_time               | 2017-10-13 17:22:45.56-04
pid                    | 32127
session_id             | 59e12e67.7d7f
session_line           | 2
command_tag            | 
session_start_time     | 2017-10-13 17:21:43-04
error_severity         | ERROR
sql_state_code         | 57014
message                | canceling autovacuum task
context                | processing work entry for relation
"gtt.public.cdrs_eric_egsnpdprecord_2017_10_13_recordopeningtime_idx"
-[ RECORD 2
]----------+---------------------------------------------------------------------------------------------------------
log_time               | 2017-10-13 17:22:44.557-04
pid                    | 32127
session_id             | 59e12e67.7d7f
session_line           | 1
session_start_time     | 2017-10-13 17:21:43-04
error_severity         | ERROR
sql_state_code         | 57014
message                | canceling autovacuum task
context                | automatic analyze of table "gtt.public.cdrs_huawei_sgsnpdprecord_2017_10_13"

Time: 375.552 ms

It looks like this table was being inserted into simultaneously by a python
program using multiprocessing.  It looks like each subprocess was INSERTing
into several tables, each of which has one BRIN index on timestamp column.

gtt=# \dt+ cdrs_eric_egsnpdprecord_2017_10_13public | cdrs_eric_egsnpdprecord_2017_10_13 | table | gtt   | 5841 MB | 

gtt=# \di+ cdrs_eric_egsnpdprecord_2017_10_13_recordopeningtime_idx public |
cdrs_eric_egsnpdprecord_2017_10_13_recordopeningtime_idx| index | gtt   | cdrs_eric_egsnpdprecord_2017_10_13 | 136 kB |


I don't have any reason to believe there's memory issue on the server, So I
suppose this is just a "heads up" to early adopters until/in case it happens
again and I can at least provide a stack trace.

Justin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SIGSEGV in BRIN autosummarize

From

Justin Pryzby

Date:

15 October 2017, 01:42:20

On Fri, Oct 13, 2017 at 10:57:32PM -0500, Justin Pryzby wrote:
> I don't have any reason to believe there's memory issue on the server, So I
> suppose this is just a "heads up" to early adopters until/in case it happens
> again and I can at least provide a stack trace.

I'm back; find stacktrace below.

> Today I see:
> < 2017-10-13 17:22:47.839 -04  >LOG:  server process (PID 32127) was terminated by signal 11: Segmentation fault
> < 2017-10-13 17:22:47.839 -04  >DETAIL:  Failed process was running: autovacuum: BRIN summarize public.gtt 747263

Is it a coincidence the server failed within 45m of yesterday's failure ?

postmaster[26500] general protection ip:84a177 sp:7ffd9b349b88 error:0 in postgres[400000+692000]
< 2017-10-14 18:05:36.432 -04  >DETAIL:  Failed process was running: autovacuum: BRIN summarize public.gtt 41087

> It looks like this table was being inserted into simultaneously by a python
> program using multiprocessing.  It looks like each subprocess was INSERTing
> into several tables, each of which has one BRIN index on timestamp column.

I should add:
These are insert-only child tables in a heirarchy (not PG10 partitions), being
inserted into directly (not via trigger/rule).

Also notice the vacuum process was interrupted, same as yesterday (think
goodness for full logs).  Our INSERT script is using python
multiprocessing.pool() with "maxtasksperchild=1", which I think means we load
one file and then exit the subprocess, and pool() creates a new subproc, which
starts a new PG session and transaction.  Which explains why autovacuum starts
processing the table only to be immediately interrupted.

postgres=# SELECT * FROM postgres_log_2017_10_14_1800 WHERE pid=26500 ORDER BY log_time DESC LIMIT 9;
log_time               | 2017-10-14 18:05:34.132-04
pid                    | 26500
session_id             | 59e289b4.6784
session_line           | 2
session_start_time     | 2017-10-14 18:03:32-04
error_severity         | ERROR
sql_state_code         | 57014
message                | canceling autovacuum task
context                | processing work entry for relation
"gtt.public.cdrs_eric_ggsnpdprecord_2017_10_14_recordopeningtime_idx"

-----------------------+---------------------------------------------------------------------------------------------------------
log_time               | 2017-10-14 18:05:32.925-04
pid                    | 26500
session_id             | 59e289b4.6784
session_line           | 1
session_start_time     | 2017-10-14 18:03:32-04
error_severity         | ERROR
sql_state_code         | 57014
message                | canceling autovacuum task
context                | automatic analyze of table "gtt.public.cdrs_eric_egsnpdprecord_2017_10_14"

gtt=# \dt+ *record_2017_10_14public | cdrs_eric_egsnpdprecord_2017_10_14 | table | gtt   | 1642 MB | public |
cdrs_eric_ggsnpdprecord_2017_10_14| table | gtt   | 492 MB  | 
 

gtt=# \di+ *_2017_10_14*_recordopeningtime_idxpublic | cdrs_eric_egsnpdprecord_2017_10_14_recordopeningtime_idx | index
|gtt   | cdrs_eric_egsnpdprecord_2017_10_14 | 72 kB | public | cdrs_eric_ggsnpdprecord_2017_10_14_recordopeningtime_idx
|index | gtt   | cdrs_eric_ggsnpdprecord_2017_10_14 | 48 kB | 
 

Due to a .."behavioral deficiency" in the loader for those tables, the crashed
backend causes the loader to get stuck, so the tables should be untouched since
the crash, should it be desirable to inspect them.

#0  pfree (pointer=0x298c740) at mcxt.c:954       context = 0x7474617261763a20
#1  0x00000000006a52e9 in perform_work_item (workitem=0x7f8ad1f94824) at autovacuum.c:2676       cur_datname =
0x298c740"no 1 :vartype 1184 :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 1 :varoattno 1 :location 146} {CONST
:consttype1184 :consttypmod -1 :constcollid 0 :constlen 8 :constbyval true :constisnull fal"...       cur_nspname =
0x298c728"s ({VAR :varno 1 :varattno 1 :vartype 1184 :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 1 :varoattno 1
:location146} {CONST :consttype 1184 :consttypmod -1 :constcollid 0 :constlen 8 :constbyv"...       cur_relname =
0x298cd68"cdrs_eric_msc_sms_2017_10_14_startofcharge_idx"       __func__ = "perform_work_item"
 
#2  0x00000000006a6fd9 in do_autovacuum () at autovacuum.c:2533       workitem = 0x7f8ad1f94824       classRel =
0x7f89c26d0e58      tuple = <value optimized out>       relScan = <value optimized out>       dbForm = <value optimized
out>      table_oids = <value optimized out>       orphan_oids = 0x0       ctl = {num_partitions = 0, ssize = 0, dsize
=0, max_dsize = 0, ffactor = 0, keysize = 4, entrysize = 80, hash = 0, match = 0, keycopy = 0, alloc = 0, hcxt = 0x0,
hctl= 0x0}       table_toast_map = 0x29c8188       cell = 0x0       shared = 0x298ce18       dbentry = 0x298d0a0
bstrategy= 0x2a61c18       key = {sk_flags = 0, sk_attno = 16, sk_strategy = 3, sk_subtype = 0, sk_collation = 100,
sk_func= {fn_addr = 0x750430 <chareq>, fn_oid = 61, fn_nargs = 2, fn_strict = 1 '\001', fn_retset = 0 '\000',
fn_stats = 2 '\002', fn_extra = 0x0, fn_mcxt = 0x28f3608, fn_expr = 0x0}, sk_argument = 116}       pg_class_desc =
0x29b66a8      effective_multixact_freeze_max_age = 400000000       did_vacuum = <value optimized out>
found_concurrent_worker= <value optimized out>       i = <value optimized out>       __func__ = "do_autovacuum"
 
#3  0x00000000006a7dc8 in AutoVacWorkerMain (argc=<value optimized out>, argv=<value optimized out>) at
autovacuum.c:1694      dbname =
"gtt\000\000\000\000\000\325I\001\327;\000\000\000\204g\000\000\000\000\000\000E\352\200\327;\000\000\000\001\000\000\000\000\000\000\000\002\000\000\000\000\000\000\000\374\024\316",
'\000'<repeats 12 times>       local_sigjmp_buf = {{__jmpbuf = {0, -5392960675600994449, 42895168, 1508018581, 26499,
1508015202,5391772750539845487, -5392961587993972881}, __mask_was_saved = 1, __saved_mask = {__val = {
18446744066192964103,13184024, 1, 13182664, 0, 26499, 1508015202, 0, 257010229664, 1, 0, 140727207371280, 257014425560,
4,0, 4}}}}       dbid = 16400       __func__ = "AutoVacWorkerMain"
 
#4  0x00000000006a7e96 in StartAutoVacWorker () at autovacuum.c:1499       worker_pid = 0       __func__ =
"StartAutoVacWorker"
#5  0x00000000006b595a in StartAutovacuumWorker (postgres_signal_arg=<value optimized out>) at postmaster.c:5414
bn= 0x28e8740
 
#6  sigusr1_handler (postgres_signal_arg=<value optimized out>) at postmaster.c:5111       save_errno = 4
__func__= "sigusr1_handler"
 
#7  <signal handler called>
No symbol table info available.
#8  0x0000003bd74e1603 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:82
No locals.
#9  0x00000000006b6c96 in ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1717
timeout = {tv_sec = 59, tv_usec = 686521}       rmask = {fds_bits = {120, 0 <repeats 15 times>}}       selres = <value
optimizedout>       now = <value optimized out>       readmask = {fds_bits = {120, 0 <repeats 15 times>}}
nSockets= 7       last_lockfile_recheck_time = 1508018581       last_touch_time = 1508015202
 
#10 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1361       opt = <value
optimizedout>       status = <value optimized out>       userDoption = <value optimized out>       listen_addr_saved =
<valueoptimized out>       i = <value optimized out>       output_config_variable = <value optimized out>
__func__= "PostmasterMain"
 
#11 0x0000000000638450 in main (argc=3, argv=0x28c03a0) at main.c:228


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SIGSEGV in BRIN autosummarize

From

Tomas Vondra

Date:

15 October 2017, 02:57:14

Hi,

On 10/15/2017 12:42 AM, Justin Pryzby wrote:
> On Fri, Oct 13, 2017 at 10:57:32PM -0500, Justin Pryzby wrote:
>> I don't have any reason to believe there's memory issue on the server, So I
>> suppose this is just a "heads up" to early adopters until/in case it happens
>> again and I can at least provide a stack trace.
> 
> I'm back; find stacktrace below.
> 
>> Today I see:
>> < 2017-10-13 17:22:47.839 -04  >LOG:  server process (PID 32127) was terminated by signal 11: Segmentation fault
>> < 2017-10-13 17:22:47.839 -04  >DETAIL:  Failed process was running: autovacuum: BRIN summarize public.gtt 747263
> 
> Is it a coincidence the server failed within 45m of yesterday's failure ?
> 

Most likely just a coincidence.

> postmaster[26500] general protection ip:84a177 sp:7ffd9b349b88 error:0 in postgres[400000+692000]
> < 2017-10-14 18:05:36.432 -04  >DETAIL:  Failed process was running: autovacuum: BRIN summarize public.gtt 41087
> 
>> It looks like this table was being inserted into simultaneously by a python
>> program using multiprocessing.  It looks like each subprocess was INSERTing
>> into several tables, each of which has one BRIN index on timestamp column.
> 
> I should add:
> These are insert-only child tables in a heirarchy (not PG10 partitions), being
> inserted into directly (not via trigger/rule).
> 
> Also notice the vacuum process was interrupted, same as yesterday (think
> goodness for full logs).  Our INSERT script is using python
> multiprocessing.pool() with "maxtasksperchild=1", which I think means we load
> one file and then exit the subprocess, and pool() creates a new subproc, which
> starts a new PG session and transaction.  Which explains why autovacuum starts
> processing the table only to be immediately interrupted.
> 

I don't follow. Why does it explain that autovacuum gets canceled? I
mean, merely opening a new connection/session should not cancel
autovacuum. That requires a command that requires table-level lock
conflicting with autovacuum (so e.g. explicit LOCK command, DDL, ...).

> ... 
> Due to a .."behavioral deficiency" in the loader for those tables, the crashed
> backend causes the loader to get stuck, so the tables should be untouched since
> the crash, should it be desirable to inspect them.
> 

It's a bit difficult to guess what went wrong from this backtrace. For
me gdb typically prints a bunch of lines immediately before the frames,
explaining what went wrong - not sure why it's missing here.

Perhaps some of those pointers are bogus, the memory was already pfree-d
or something like that. You'll have to poke around and try dereferencing
the pointers to find what works and what does not.

For example what do these gdb commands do in the #0 frame?

(gdb) p *(MemoryContext)context
(gdb) p *GetMemoryChunkContext(pointer)

> #0  pfree (pointer=0x298c740) at mcxt.c:954
>         context = 0x7474617261763a20
> #1  0x00000000006a52e9 in perform_work_item (workitem=0x7f8ad1f94824) at autovacuum.c:2676
>         cur_datname = 0x298c740 "no 1 :vartype 1184 :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 1 :varoattno
1:location 146} {CONST :consttype 1184 :consttypmod -1 :constcollid 0 :constlen 8 :constbyval true :constisnull
fal"...
>         cur_nspname = 0x298c728 "s ({VAR :varno 1 :varattno 1 :vartype 1184 :vartypmod -1 :varcollid 0 :varlevelsup 0
:varnoold1 :varoattno 1 :location 146} {CONST :consttype 1184 :consttypmod -1 :constcollid 0 :constlen 8 :constbyv"...
 
>         cur_relname = 0x298cd68 "cdrs_eric_msc_sms_2017_10_14_startofcharge_idx"
>         __func__ = "perform_work_item"
> #2  0x00000000006a6fd9 in do_autovacuum () at autovacuum.c:2533
...

cheers

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SIGSEGV in BRIN autosummarize

From

Justin Pryzby

Date:

15 October 2017, 04:56:56

On Fri, Oct 13, 2017 at 10:57:32PM -0500, Justin Pryzby wrote:
> > Also notice the vacuum process was interrupted, same as yesterday (think
> > goodness for full logs).  Our INSERT script is using python
> > multiprocessing.pool() with "maxtasksperchild=1", which I think means we load
> > one file and then exit the subprocess, and pool() creates a new subproc, which
> > starts a new PG session and transaction.  Which explains why autovacuum starts
> > processing the table only to be immediately interrupted.

On Sun, Oct 15, 2017 at 01:57:14AM +0200, Tomas Vondra wrote:
> I don't follow. Why does it explain that autovacuum gets canceled? I
> mean, merely opening a new connection/session should not cancel
> autovacuum. That requires a command that requires table-level lock
> conflicting with autovacuum (so e.g. explicit LOCK command, DDL, ...).

I was thinking that INSERT would do it, but I gather you're right about
autovacuum.  Let me get back to you about this..

> > Due to a .."behavioral deficiency" in the loader for those tables, the crashed
> > backend causes the loader to get stuck, so the tables should be untouched since
> > the crash, should it be desirable to inspect them.
> > 
> 
> It's a bit difficult to guess what went wrong from this backtrace. For
> me gdb typically prints a bunch of lines immediately before the frames,
> explaining what went wrong - not sure why it's missing here.

Do you mean this ?

...
Loaded symbols for /lib64/libnss_files-2.12.so
Core was generated by `postgres: autovacuum worker process   gtt             '.
Program terminated with signal 11, Segmentation fault.
#0  pfree (pointer=0x298c740) at mcxt.c:954
954             (*context->methods->free_p) (context, pointer);

> Perhaps some of those pointers are bogus, the memory was already pfree-d
> or something like that. You'll have to poke around and try dereferencing
> the pointers to find what works and what does not.
> 
> For example what do these gdb commands do in the #0 frame?
> 
> (gdb) p *(MemoryContext)context

(gdb) p *(MemoryContext)context
Cannot access memory at address 0x7474617261763a20

> (gdb) p *GetMemoryChunkContext(pointer)

(gdb) p *GetMemoryChunkContext(pointer)
No symbol "GetMemoryChunkContext" in current context.

I had to do this since it's apparently inlined/macro:
(gdb) p *(MemoryContext *) (((char *) pointer) - sizeof(void *))
$8 = (MemoryContext) 0x7474617261763a20

I uploaded the corefile:
http://telsasoft.com/tmp/coredump-postgres-autovacuum-brin-summarize.gz

Justin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SIGSEGV in BRIN autosummarize

From

Tomas Vondra

Date:

15 October 2017, 15:44:58

Hi,

On 10/15/2017 03:56 AM, Justin Pryzby wrote:
> On Fri, Oct 13, 2017 at 10:57:32PM -0500, Justin Pryzby wrote:
...
>> It's a bit difficult to guess what went wrong from this backtrace. For
>> me gdb typically prints a bunch of lines immediately before the frames,
>> explaining what went wrong - not sure why it's missing here.
> 
> Do you mean this ?
> 
> ...
> Loaded symbols for /lib64/libnss_files-2.12.so
> Core was generated by `postgres: autovacuum worker process   gtt             '.
> Program terminated with signal 11, Segmentation fault.
> #0  pfree (pointer=0x298c740) at mcxt.c:954
> 954             (*context->methods->free_p) (context, pointer);
> 

Yes. So either 'context' is bogus. Or 'context->methods' is bogus. Or
'context->methods->free_p' is bogus.

>> Perhaps some of those pointers are bogus, the memory was already pfree-d
>> or something like that. You'll have to poke around and try dereferencing
>> the pointers to find what works and what does not.
>>
>> For example what do these gdb commands do in the #0 frame?
>>
>> (gdb) p *(MemoryContext)context
> 
> (gdb) p *(MemoryContext)context
> Cannot access memory at address 0x7474617261763a20
> 

OK, this means the memory context pointer (tracked in the header of a
chunk) is bogus. There are multiple common ways how that could happen:

* Something corrupts memory (typically out-of-bounds write).

* The pointer got allocated in an incorrect memory context (which then
was released, and the memory was reused for new stuff).

* It's a use-after-free.

* ... various other possibilities ...

> 
> I uploaded the corefile:
> http://telsasoft.com/tmp/coredump-postgres-autovacuum-brin-summarize.gz
> 

Thanks, but I'm not sure that'll help, at this point. We already know
what happened (corrupted memory), we don't know "how". And core files
are mostly just "snapshots" so are not very useful in answering that :-(

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SIGSEGV in BRIN autosummarize

From

Justin Pryzby

Date:

15 October 2017, 20:08:05

On Sat, Oct 14, 2017 at 08:56:56PM -0500, Justin Pryzby wrote:
> On Fri, Oct 13, 2017 at 10:57:32PM -0500, Justin Pryzby wrote:
> > > Also notice the vacuum process was interrupted, same as yesterday (think
> > > goodness for full logs).  Our INSERT script is using python
> > > multiprocessing.pool() with "maxtasksperchild=1", which I think means we load
> > > one file and then exit the subprocess, and pool() creates a new subproc, which
> > > starts a new PG session and transaction.  Which explains why autovacuum starts
> > > processing the table only to be immediately interrupted.
> 
> On Sun, Oct 15, 2017 at 01:57:14AM +0200, Tomas Vondra wrote:
> > I don't follow. Why does it explain that autovacuum gets canceled? I
> > mean, merely opening a new connection/session should not cancel
> > autovacuum. That requires a command that requires table-level lock
> > conflicting with autovacuum (so e.g. explicit LOCK command, DDL, ...).
> 
> I was thinking that INSERT would do it, but I gather you're right about
> autovacuum.  Let me get back to you about this..

I confirmed that we're taking an explicit lock before creating new child tables
(as I recall, to avoid errors in the logs shortly after midnight when multiple
subprocesses see data for the new date for the first time):
2017-10-15 12:52:50.499-04 | 59e3925e.6951 | statement: LOCK TABLE cdrs_huawei_sgsnPDPRecord IN SHARE UPDATE EXCLUSIVE
MODE

Probably we can improve that with LOCK TABLE ONLY.

Justin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SIGSEGV in BRIN autosummarize

From

Justin Pryzby

Date:

17 October 2017, 09:33:11

On Sun, Oct 15, 2017 at 02:44:58PM +0200, Tomas Vondra wrote:
> Thanks, but I'm not sure that'll help, at this point. We already know
> what happened (corrupted memory), we don't know "how". And core files
> are mostly just "snapshots" so are not very useful in answering that :-(

Is there anything I should be saving for these or hints how else to debug?  I
just got a segfault while looking for logs from an segfault from 2hr
earlier...but it was a duplicate from the first stacktrace...

< 2017-10-16 23:21:23.172 -04  >LOG:  server process (PID 31543) was terminated by signal 11: Segmentation fault
< 2017-10-16 23:21:23.172 -04  >DETAIL:  Failed process was running: autovacuum: BRIN summarize public.gtt 162175

Oct 16 23:21:22 database kernel: postmaster[31543] general protection ip:4bd467 sp:7ffe08a94890 error:0 in
postgres[400000+692000]
...
Oct 17 01:58:36 database kernel: postmaster[8646]: segfault at 8 ip 000000000084a177 sp 00007ffe08a94a88 error 4 in
postgres[400000+692000]
Oct 17 01:58:38 database abrt[9192]: Saved core dump of pid 8646 (/usr/pgsql-10/bin/postgres) to
/var/spool/abrt/ccpp-2017-10-17-01:58:38-8646(7692288 bytes)
 
Oct 17 01:58:38 database abrtd: Directory 'ccpp-2017-10-17-01:58:38-8646' creation detected

Oct 17 01:58:38 database abrt[9192]: Saved core dump of pid 8646 (/usr/pgsql-10/bin/postgres) to
/var/spool/abrt/ccpp-2017-10-17-01:58:38-8646(7692288 bytes)
 
Oct 17 01:58:38 database abrtd: Directory 'ccpp-2017-10-17-01:58:38-8646' creation detected
Oct 17 01:59:18 database abrtd: Generating core_backtrace
Oct 17 01:59:18 database abrtd: Duplicate: core backtrace
Oct 17 01:59:18 database abrtd: DUP_OF_DIR: /var/spool/abrt/ccpp-2017-10-14-18:05:35-26500
Oct 17 01:59:18 database abrtd: Deleting problem directory ccpp-2017-10-17-01:58:38-8646 (dup of
ccpp-2017-10-14-18:05:35-26500)

postgres=# SELECT * FROM postgres_log_2017_10_16_2300 WHERE pid=31543;
-[ RECORD 1
]----------+-------------------------------------------------------------------------------------------------
log_time               | 2017-10-16 23:21:22.021-04
pid                    | 31543
session_id             | 59e57677.7b37
session_line           | 2
session_start_time     | 2017-10-16 23:18:15-04
virtual_transaction_id | 22/54376
transaction_id         | 0
error_severity         | ERROR
sql_state_code         | 57014
message                | canceling autovacuum task
context                | automatic analyze of table "gtt.public.cdrs_eric_egsnpdprecord_2017_10_15"
-[ RECORD 2
]----------+-------------------------------------------------------------------------------------------------
log_time               | 2017-10-16 23:21:22.021-04
pid                    | 31543
session_id             | 59e57677.7b37
session_line           | 3
session_start_time     | 2017-10-16 23:18:15-04
virtual_transaction_id | 22/54377
transaction_id         | 0
error_severity         | ERROR
sql_state_code         | 57014
message                | canceling autovacuum task
context                | processing work entry for relation
"gtt.public.cdrs_eric_msc_voice_2017_10_16_startofcharge_idx"
-[ RECORD 3
]----------+-------------------------------------------------------------------------------------------------
log_time               | 2017-10-16 23:21:21.003-04
pid                    | 31543
session_id             | 59e57677.7b37
session_line           | 1
session_start_time     | 2017-10-16 23:18:15-04
virtual_transaction_id | 22/54374
transaction_id         | 0
error_severity         | ERROR
sql_state_code         | 57014
message                | canceling autovacuum task
context                | automatic analyze of table "gtt.public.cdrs_eric_ggsnpdprecord_2017_10_15"

Note, the most recent LOCK TABLE I could find beofre 23:21:22 was actually
~10min earlier at 2017-10-16 23:12:16.519-04.

Core was generated by `postgres: autovacuum worker process   gtt             '.
Program terminated with signal 11, Segmentation fault.
#0  index_close (relation=0x324647603246466, lockmode=1) at indexam.c:178
178             LockRelId       relid = relation->rd_lockInfo.lockRelId;
Missing separate debuginfos, use: debuginfo-install audit-libs-2.4.5-6.el6.x86_64
cyrus-sasl-lib-2.1.23-15.el6_6.2.x86_64keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64
libcom_err-1.41.12-23.el6.x86_64libgcc-4.4.7-18.el6.x86_64 libicu-4.2.1-14.el6.x86_64 libselinux-2.0.94-7.el6.x86_64
libstdc++-4.4.7-18.el6.x86_64nspr-4.13.1-1.el6.x86_64 nss-3.28.4-1.el6_9.x86_64
nss-softokn-freebl-3.14.3-23.3.el6_8.x86_64nss-util-3.28.4-1.el6_9.x86_64 openldap-2.4.40-16.el6.x86_64
openssl-1.0.1e-57.el6.x86_64pam-1.1.1-24.el6.x86_64 zlib-1.2.3-29.el6.x86_64
 

(gdb) p relation
$1 = (struct RelationData *) 0x324647603246466
(gdb) p relation->rd_lockInfo
Cannot access memory at address 0x3246476032464aa

(gdb) bt
#0  index_close (relation=0x324647603246466, lockmode=1) at indexam.c:178
#1  0x00000000004bc3c2 in systable_endscan (sysscan=0x145b3c8) at genam.c:504
#2  0x000000000080f266 in SearchCatCache (cache=0x1427168, v1=<value optimized out>, v2=<value optimized out>,
v3=<valueoptimized out>, v4=<value optimized out>) at catcache.c:1190
 
#3  0x000000000081cd7b in get_rel_name (relid=<value optimized out>) at lsyscache.c:1730
#4  0x00000000006a5299 in perform_work_item (workitem=0x7fd7c7f94860) at autovacuum.c:2609
#5  0x00000000006a6fd9 in do_autovacuum () at autovacuum.c:2533
#6  0x00000000006a7dc8 in AutoVacWorkerMain (argc=<value optimized out>, argv=<value optimized out>) at
autovacuum.c:1694
#7  0x00000000006a7e96 in StartAutoVacWorker () at autovacuum.c:1499
#8  0x00000000006b595a in StartAutovacuumWorker (postgres_signal_arg=<value optimized out>) at postmaster.c:5414
#9  sigusr1_handler (postgres_signal_arg=<value optimized out>) at postmaster.c:5111
#10 <signal handler called>
#11 0x0000003bd74e1603 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:82
#12 0x00000000006b6c96 in ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1717
#13 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1361
#14 0x0000000000638450 in main (argc=3, argv=0x13cd3a0) at main.c:228

#0  index_close (relation=0x324647603246466, lockmode=1) at indexam.c:178       relid = {relId = 21345368, dbId = 0}
#1  0x00000000004bc3c2 in systable_endscan (sysscan=0x145b3c8) at genam.c:504
No locals.
#2  0x000000000080f266 in SearchCatCache (cache=0x1427168, v1=<value optimized out>, v2=<value optimized out>,
v3=<valueoptimized out>, v4=<value optimized out>) at catcache.c:1190       cur_skey = {{sk_flags = 0, sk_attno = 1,
sk_strategy= 3, sk_subtype = 0, sk_collation = 0, sk_func = {fn_addr = 0x7a8d50 <oideq>, fn_oid = 184, fn_nargs = 2,
fn_strict= 1 '\001',             fn_retset = 0 '\000', fn_stats = 2 '\002', fn_extra = 0x0, fn_mcxt = 0x13cf3d8,
fn_expr= 0x0}, sk_argument = 618293329}, {sk_flags = 0, sk_attno = 0, sk_strategy = 0, sk_subtype = 0,
sk_collation= 0, sk_func = {fn_addr = 0, fn_oid = 0, fn_nargs = 0, fn_strict = 0 '\000', fn_retset = 0 '\000', fn_stats
=0 '\000', fn_extra = 0x0, fn_mcxt = 0x0, fn_expr = 0x0}, sk_argument = 0},         {sk_flags = 0, sk_attno = 0,
sk_strategy= 0, sk_subtype = 0, sk_collation = 0, sk_func = {fn_addr = 0, fn_oid = 0, fn_nargs = 0, fn_strict = 0
'\000',fn_retset = 0 '\000', fn_stats = 0 '\000',             fn_extra = 0x0, fn_mcxt = 0x0, fn_expr = 0x0},
sk_argument= 0}, {sk_flags = 0, sk_attno = 0, sk_strategy = 0, sk_subtype = 0, sk_collation = 0, sk_func = {fn_addr =
0,fn_oid = 0, fn_nargs = 0,             fn_strict = 0 '\000', fn_retset = 0 '\000', fn_stats = 0 '\000', fn_extra =
0x0,fn_mcxt = 0x0, fn_expr = 0x0}, sk_argument = 0}}       hashValue = 884019       hashIndex = 51       iter = {cur =
<valueoptimized out>, end = 0x14276a8}       bucket = 0x14276a8       ct = <value optimized out>       relation =
0x7fd6b8788e58      scandesc = 0x145b3c8       ntp = <value optimized out>
 
#3  0x000000000081cd7b in get_rel_name (relid=<value optimized out>) at lsyscache.c:1730       tp = <value optimized
out>
#4  0x00000000006a5299 in perform_work_item (workitem=0x7fd7c7f94860) at autovacuum.c:2609       cur_datname = 0x0
cur_nspname = 0x0       cur_relname = 0x0       __func__ = "perform_work_item"
 
#5  0x00000000006a6fd9 in do_autovacuum () at autovacuum.c:2533       workitem = 0x7fd7c7f94860       classRel =
0x7fd6b8788e58      tuple = <value optimized out>       relScan = <value optimized out>       dbForm = <value optimized
out>      table_oids = <value optimized out>       orphan_oids = 0x0       ctl = {num_partitions = 0, ssize = 0, dsize
=0, max_dsize = 0, ffactor = 0, keysize = 4, entrysize = 80, hash = 0, match = 0, keycopy = 0, alloc = 0, hcxt = 0x0,
hctl= 0x0}       table_toast_map = 0x14d5228       cell = 0x0       shared = 0x1499eb8       dbentry = 0x149a140
bstrategy= 0x156ef38       key = {sk_flags = 0, sk_attno = 16, sk_strategy = 3, sk_subtype = 0, sk_collation = 100,
sk_func= {fn_addr = 0x750430 <chareq>, fn_oid = 61, fn_nargs = 2, fn_strict = 1 '\001', fn_retset = 0 '\000',
fn_stats= 2 '\002', fn_extra = 0x0, fn_mcxt = 0x1400838, fn_expr = 0x0}, sk_argument = 116}       pg_class_desc =
0x14c3748      effective_multixact_freeze_max_age = 400000000       did_vacuum = <value optimized out>
found_concurrent_worker= <value optimized out>       i = <value optimized out>       __func__ = "do_autovacuum"
 
#6  0x00000000006a7dc8 in AutoVacWorkerMain (argc=<value optimized out>, argv=<value optimized out>) at
autovacuum.c:1694      dbname =
"gtt\000\000\000\000\000\325I\001\327;\000\000\000\067{\000\000\000\000\000\000E\352\200\327;\000\000\000\001\000\000\000\000\000\000\000\002\000\000\000\000\000\000\000\374\024\316",
'\000'<repeats 12 times>       local_sigjmp_buf = {{__jmpbuf = {0, -6625136415178345871, 20928464, 1508210280, 31539,
1508209622,6625596434918027889, -6625136601089035663}, __mask_was_saved = 1, __saved_mask = {__val = {
18446744066192964103,13184024, 1, 13182664, 0, 31539, 1508209622, 0, 257010229664, 1, 0, 140729043734800, 257014425560,
4,0, 4}}}}       dbid = 16400       __func__ = "AutoVacWorkerMain"
 
#7  0x00000000006a7e96 in StartAutoVacWorker () at autovacuum.c:1499       worker_pid = 0       __func__ =
"StartAutoVacWorker"
#8  0x00000000006b595a in StartAutovacuumWorker (postgres_signal_arg=<value optimized out>) at postmaster.c:5414
bn= 0x13f57d0
 
#9  sigusr1_handler (postgres_signal_arg=<value optimized out>) at postmaster.c:5111       save_errno = 4
__func__= "sigusr1_handler"
 
#10 <signal handler called>
No symbol table info available.
#11 0x0000003bd74e1603 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:82
---Type <return> to continue, or q <return> to quit---
No locals.
#12 0x00000000006b6c96 in ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1717
timeout = {tv_sec = 59, tv_usec = 981940}       rmask = {fds_bits = {120, 0 <repeats 15 times>}}       selres = <value
optimizedout>       now = <value optimized out>       readmask = {fds_bits = {120, 0 <repeats 15 times>}}
nSockets= 7       last_lockfile_recheck_time = 1508210280       last_touch_time = 1508209622
 
#13 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1361       opt = <value
optimizedout>       status = <value optimized out>       userDoption = <value optimized out>       listen_addr_saved =
<valueoptimized out>       i = <value optimized out>       output_config_variable = <value optimized out>
__func__= "PostmasterMain"
 
#14 0x0000000000638450 in main (argc=3, argv=0x13cd3a0) at main.c:228


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SIGSEGV in BRIN autosummarize

From

Alvaro Herrera

Date:

17 October 2017, 10:51:33

Justin Pryzby wrote:
> On Sun, Oct 15, 2017 at 02:44:58PM +0200, Tomas Vondra wrote:
> > Thanks, but I'm not sure that'll help, at this point. We already know
> > what happened (corrupted memory), we don't know "how". And core files
> > are mostly just "snapshots" so are not very useful in answering that :-(
> 
> Is there anything I should be saving for these or hints how else to debug?

Sorry, I had missed this report.  I'll look into it today.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SIGSEGV in BRIN autosummarize

From

Alvaro Herrera

Date:

17 October 2017, 13:59:16

Justin Pryzby wrote:

> #1  0x00000000006a52e9 in perform_work_item (workitem=0x7f8ad1f94824) at autovacuum.c:2676
>         cur_datname = 0x298c740 "no 1 :vartype 1184 :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 1 :varoattno
1:location 146} {CONST :consttype 1184 :consttypmod -1 :constcollid 0 :constlen 8 :constbyval true :constisnull
fal"...
>         cur_nspname = 0x298c728 "s ({VAR :varno 1 :varattno 1 :vartype 1184 :vartypmod -1 :varcollid 0 :varlevelsup 0
:varnoold1 :varoattno 1 :location 146} {CONST :consttype 1184 :consttypmod -1 :constcollid 0 :constlen 8 :constbyv"...

>         cur_relname = 0x298cd68 "cdrs_eric_msc_sms_2017_10_14_startofcharge_idx"
>         __func__ = "perform_work_item"

cur_datname here seems corrupted -- it points halfway into cur_nspname,
which is also a corrupt value.  And I think that's because we're not
checking that the namespace OID is a valid value before calling
get_namespace_name on it.  And I'm betting that these values are all not
what we expect, because we're not checking that we're in the correct
database before trying to execute the work item.  I don't quite
understand how this results in an invalid string rather than just a
NULL, as I would have expected.

Anyway, can give this patch a try?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

0001-Fix-autovacuum-workitems.patch

Re: [HACKERS] SIGSEGV in BRIN autosummarize

From

Justin Pryzby

Date:

17 October 2017, 15:29:22

On Tue, Oct 17, 2017 at 12:59:16PM +0200, Alvaro Herrera wrote:
> Anyway, can give this patch a try?

I've only compiled postgres once before and this is a production environment
(althought nothing so important that the crashes are a serious concern either).

Is it reasonable to wget the postgres tarball, apply the patch, and run the
compiled postgres binary from the source tree, without running make install or
similar ?  Otherwise, would it be good enough to copy the postgres binary to
/usr/pgsql-10/bin (and reinstall the binary package later) ?

Thanks,
Justin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SIGSEGV in BRIN autosummarize

From

Justin Pryzby

Date:

17 October 2017, 15:49:24

On Tue, Oct 17, 2017 at 12:59:16PM +0200, Alvaro Herrera wrote:
> Justin Pryzby wrote:
> 
> > #1  0x00000000006a52e9 in perform_work_item (workitem=0x7f8ad1f94824) at autovacuum.c:2676
> >         cur_datname = 0x298c740 "no 1 :vartype 1184 :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 1
:varoattno1 :location 146} {CONST :consttype 1184 :consttypmod -1 :constcollid 0 :constlen 8 :constbyval true
:constisnullfal"...
 
> >         cur_nspname = 0x298c728 "s ({VAR :varno 1 :varattno 1 :vartype 1184 :vartypmod -1 :varcollid 0 :varlevelsup
0:varnoold 1 :varoattno 1 :location 146} {CONST :consttype 1184 :consttypmod -1 :constcollid 0 :constlen 8
:constbyv"...
> >         cur_relname = 0x298cd68 "cdrs_eric_msc_sms_2017_10_14_startofcharge_idx"
> >         __func__ = "perform_work_item"
> 
> cur_datname here seems corrupted -- it points halfway into cur_nspname,
> which is also a corrupt value.  And I think that's because we're not
> checking that the namespace OID is a valid value before calling
> get_namespace_name on it.  And I'm betting that these values are all not
> what we expect, because we're not checking that we're in the correct
> database before trying to execute the work item.  I don't quite
> understand how this results in an invalid string rather than just a
> NULL, as I would have expected.

I'm happy to try the patch, but in case it makes any difference, we have few
DBs/schemas:

postgres=# \dn List of schemas Name  |  Owner   
--------+----------public | postgres
(1 row)

postgres=# \l+                                                                   List of databases  Name    |  Owner
|Encoding |   Collate   |    Ctype    |   Access privileges   |  Size   | Tablespace |                Description
         
 

-----------+----------+----------+-------------+-------------+-----------------------+---------+------------+--------------------------------------------gtt
     | pryzbyj  | UTF8     | en_US.UTF-8 | en_US.UTF-8 |                       | 2215 GB | pg_default | postgres  |
postgres| UTF8     | en_US.UTF-8 | en_US.UTF-8 |                       | 76 GB   | pg_default | default administrative
connectiondatabasetemplate0 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +| 7625 kB |
pg_default| unmodifiable empty database          |          |          |             |             |
postgres=CTc/postgres|         |            | template1 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
postgres=CTc/postgres+|7485 kB | pg_default | default template for new databases          |          |          |
     |             | =c/postgres           |         |            | 
 
(4 rows)

Note that db:postgres is of substantial size since I set log_statement=all and
log_destination=stderr,csv, and import all CSVs into a table heirarchy in
db:postgres.  There are 3 BRIN indices per (daily child) table in db:gtt (and
none in postgres, although that's probably a good idea for the timestamp column
once all our customers are upgraded to PG10).

Justin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SIGSEGV in BRIN autosummarize

From

Tomas Vondra

Date:

17 October 2017, 15:59:32

On 10/17/2017 02:29 PM, Justin Pryzby wrote:
> On Tue, Oct 17, 2017 at 12:59:16PM +0200, Alvaro Herrera wrote:
>> Anyway, can give this patch a try?
> 
> I've only compiled postgres once before and this is a production environment
> (althought nothing so important that the crashes are a serious concern either).
> 
> Is it reasonable to wget the postgres tarball, apply the patch, and run the
> compiled postgres binary from the source tree, without running make install or
> similar ?  Otherwise, would it be good enough to copy the postgres binary to
> /usr/pgsql-10/bin (and reinstall the binary package later) ?
> 

You don't have to install the binaries to the same location, i.e. you
can keep both the current and modified binaries.

./configure --prefix=/path/to/alternative/binaries --enable-debug
CFLAGS="..."

To get the same compilation options you can run pg_config, look for
CONFIGURE line and then just modify add --prefix option.

And after `make install` you can add it to $PATH and start the server
using those binaries.

$ export PATH=/path/to/alternative/binaries/bin:$PATH
$ which pg_ctl
$ pg_ctl -D $DATADIR stop
$ pg_ctl -D $DATADIR start


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SIGSEGV in BRIN autosummarize

From

Tom Lane

Date:

17 October 2017, 16:34:24

Justin Pryzby <pryzby@telsasoft.com> writes:
> On Tue, Oct 17, 2017 at 12:59:16PM +0200, Alvaro Herrera wrote:
>> Anyway, can give this patch a try?

> I've only compiled postgres once before and this is a production environment
> (althought nothing so important that the crashes are a serious concern either).

> Is it reasonable to wget the postgres tarball, apply the patch, and run the
> compiled postgres binary from the source tree, without running make install or
> similar ?  Otherwise, would it be good enough to copy the postgres binary to
> /usr/pgsql-10/bin (and reinstall the binary package later) ?

The trick in this sort of situation is to make sure you build binaries
that match your existing install in every way except having the added
patch, and maybe getting installed into a different directory.

So: where did you get the existing binaries?  If it's from some vendor
packaging system, what you should do is fetch the package source, add
the patch to the probably-nonempty set of patches the vendor is applying,
and rebuild your own custom package version.  If you haven't done that
before, it's a good skill to acquire ...
        regards, tom lane


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SIGSEGV in BRIN autosummarize

From

Justin Pryzby

Date:

17 October 2017, 17:07:40

On Tue, Oct 17, 2017 at 09:34:24AM -0400, Tom Lane wrote:
> Justin Pryzby <pryzby@telsasoft.com> writes:
> > On Tue, Oct 17, 2017 at 12:59:16PM +0200, Alvaro Herrera wrote:
> >> Anyway, can give this patch a try?
> 
> > I've only compiled postgres once before and this is a production environment
> > (althought nothing so important that the crashes are a serious concern either).
> 
> > Is it reasonable to wget the postgres tarball, apply the patch, and run the
> > compiled postgres binary from the source tree, without running make install or
> > similar ?  Otherwise, would it be good enough to copy the postgres binary to
> > /usr/pgsql-10/bin (and reinstall the binary package later) ?
> 
> The trick in this sort of situation is to make sure you build binaries
> that match your existing install in every way except having the added
> patch, and maybe getting installed into a different directory.
> 
> So: where did you get the existing binaries?  If it's from some vendor
> packaging system, what you should do is fetch the package source, add
> the patch to the probably-nonempty set of patches the vendor is applying,
> and rebuild your own custom package version.  If you haven't done that
> before, it's a good skill to acquire ...

I'm familiar with that process; but, these are PG10 binaries from PGDG for
centos6 x64.  I had to add symlinks for postgis library, but otherwise seems to
be working fine (although I didn't preserve as many configure options as your
message would suggest I should have).

Justin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SIGSEGV in BRIN autosummarize

From

Tom Lane

Date:

17 October 2017, 17:39:21

Justin Pryzby <pryzby@telsasoft.com> writes:
> On Tue, Oct 17, 2017 at 09:34:24AM -0400, Tom Lane wrote:
>> So: where did you get the existing binaries?  If it's from some vendor
>> packaging system, what you should do is fetch the package source, add
>> the patch to the probably-nonempty set of patches the vendor is applying,
>> and rebuild your own custom package version.  If you haven't done that
>> before, it's a good skill to acquire ...

> I'm familiar with that process; but, these are PG10 binaries from PGDG for
> centos6 x64.

Well, I'm pretty sure Devrim builds those using the RPM build process.
I'd have grabbed his SRPM and proceeded as above.
        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SIGSEGV in BRIN autosummarize

From

Alvaro Herrera

Date:

17 October 2017, 17:43:30

Justin Pryzby wrote:

> I'm happy to try the patch, but in case it makes any difference, we have few
> DBs/schemas:

I don't expect that it does.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SIGSEGV in BRIN autosummarize

From

Craig Ringer

Date:

17 October 2017, 18:27:14

On 17 October 2017 at 22:39, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Justin Pryzby <pryzby@telsasoft.com> writes:
>> On Tue, Oct 17, 2017 at 09:34:24AM -0400, Tom Lane wrote:
>>> So: where did you get the existing binaries?  If it's from some vendor
>>> packaging system, what you should do is fetch the package source, add
>>> the patch to the probably-nonempty set of patches the vendor is applying,
>>> and rebuild your own custom package version.  If you haven't done that
>>> before, it's a good skill to acquire ...
>
>> I'm familiar with that process; but, these are PG10 binaries from PGDG for
>> centos6 x64.
>
> Well, I'm pretty sure Devrim builds those using the RPM build process.
> I'd have grabbed his SRPM and proceeded as above.

Yep, or unpack the tarball and apply the patch then build with the
same configure options as you find in the rpm spec. That can be easier
when iterating tests and builds.

Since the patches are separate, you can skip the tarball and clone the
same tag from git instead. Then apply the rpm patches as separate
commits. That's typically what I'll do, makes it easier to keep track.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SIGSEGV in BRIN autosummarize

From

Tom Lane

Date:

17 October 2017, 19:49:55

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> cur_datname here seems corrupted -- it points halfway into cur_nspname,
> which is also a corrupt value.

Yeah.

> And I think that's because we're not
> checking that the namespace OID is a valid value before calling
> get_namespace_name on it.

No, because get_namespace_name should return NULL if it doesn't find
any pg_namespace entry.  That would work just as well for InvalidOid
as for any other nonexistent schema OID.

The part of your patch that adds a check on avw_database is clearly
correct and necessary.  I'm thinking the change you propose in
perform_work_item is just overcomplicating code that's okay as it
stands.  We don't need to optimize for the schema-went-away case.

What I'm suspicious of as the actual bug cause is the comment in
perform_work_item about how we need to be sure that we're allocating these
strings in a long-lived context.  If, in fact, they were allocated in some
context that could get reset during the PG_TRY (particularly in the
error-cleanup path, which I bet we don't test), then the observed symptom
that the pointers seem to be pointing at garbage could be explained.

So what I'm thinking is that you need an error during perform_work_item,
and/or more than one work_item picked up in the calling loop, to make this
bug manifest.  You would need to enter perform_work_item in a
non-long-lived context, so the first time through the loop is probably
safe anyway.

BTW, it seems like the "Perform operations on collected tables." loop in
do_autovacuum has probably got similar latent bugs.  We take care to enter
it in AutovacMemCxt initially, but it looks to me like subsequent
iterations probably start out in some transaction context, because the
PG_TRY around autovacuum_do_vac_analyze doesn't do anything about
switching back to AutovacMemCxt.  There needs to be a bit more clarity
throughout this code about what CurrentMemoryContext ought to be at each
point.  Appropriate fixes might involve switching back to AutovacMemCxt
after each of those PG_TRY blocks.
        regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SIGSEGV in BRIN autosummarize

From

Justin Pryzby

Date:

18 October 2017, 19:48:55

On Tue, Oct 17, 2017 at 09:07:40AM -0500, Justin Pryzby wrote:
> On Tue, Oct 17, 2017 at 09:34:24AM -0400, Tom Lane wrote:
> > Justin Pryzby <pryzby@telsasoft.com> writes:
> > > On Tue, Oct 17, 2017 at 12:59:16PM +0200, Alvaro Herrera wrote:
> > >> Anyway, can give this patch a try?
> > 
> > The trick in this sort of situation is to make sure you build binaries
> > that match your existing install in every way except having the added
> > patch, and maybe getting installed into a different directory.
>
> I'm familiar with that process; but, these are PG10 binaries from PGDG for
> centos6 x64.  I had to add symlinks for postgis library, but otherwise seems to
> be working fine (although I didn't preserve as many configure options as your
> message would suggest I should have).

On Tue, Oct 17, 2017 at 12:49:55PM -0400, Tom Lane wrote:
> So what I'm thinking is that you need an error during perform_work_item,
> and/or more than one work_item picked up in the calling loop, to make this
> bug manifest.  You would need to enter perform_work_item in a

..in our case probably due to interruption by LOCK TABLE, yes?

On Tue, Oct 17, 2017 at 12:49:55PM -0400, Tom Lane wrote:
> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> > And I think that's because we're not
> > checking that the namespace OID is a valid value before calling
> > get_namespace_name on it.
> 
> The part of your patch that adds a check on avw_database is clearly
> correct and necessary.  I'm thinking the change you propose in
> perform_work_item is just overcomplicating code that's okay as it
> stands.  We don't need to optimize for the schema-went-away case.

No crashes in ~28hr.  It occurs to me that it's a weaker test due to not
preserving most compilation options.  If I understand, our crash isn't
explained by the avw_database test anyway (?)

Should I make clean and recompile with all non-prefix options and a minimal
patch (avw_database==MyDatabaseId || continue) ?

Or recompile with existing options but no patch to first verify crash occurs
with locally compiled binary?

Justin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SIGSEGV in BRIN autosummarize

From

Alvaro Herrera

Date:

18 October 2017, 19:54:09

Justin Pryzby wrote:

> No crashes in ~28hr.  It occurs to me that it's a weaker test due to not
> preserving most compilation options.

And the previous code crashes in 45 minutes?  That's solid enough for
me; I'll clean up the patch and push in the next few days.  I think what
you have now should be sufficient for the time being for your production
system.

> If I understand, our crash isn't explained by the avw_database test
> anyway (?)

I don't see why you would think that -- I disagree.

> Should I make clean and recompile with all non-prefix options and a minimal
> patch (avw_database==MyDatabaseId || continue) ?
> 
> Or recompile with existing options but no patch to first verify crash occurs
> with locally compiled binary?

I don't think either of these actions is necessary.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SIGSEGV in BRIN autosummarize

From

Justin Pryzby

Date:

18 October 2017, 20:06:42

On Wed, Oct 18, 2017 at 06:54:09PM +0200, Alvaro Herrera wrote:
> Justin Pryzby wrote:
> 
> > No crashes in ~28hr.  It occurs to me that it's a weaker test due to not
> > preserving most compilation options.
> 
> And the previous code crashes in 45 minutes?  That's solid enough for
> me; I'll clean up the patch and push in the next few days.  I think what
> you have now should be sufficient for the time being for your production
> system.

No - the crash happened 4 times since adding BRIN+autosummarize 6 days ago, and
in once instance occured twice within 3 hours (while I was trying to query logs
for the preceding crash).

[pryzbyj@database ~]$ sudo grep -hE 'in postgres|Saved core' /var/log/messages*
Oct 13 17:22:45 database kernel: postmaster[32127] general protection ip:4bd467 sp:7ffd9b349990 error:0 in
postgres[400000+692000]
Oct 13 17:22:47 database abrt[32387]: Saved core dump of pid 32127 (/usr/pgsql-10/bin/postgres) to
/var/spool/abrt/ccpp-2017-10-13-17:22:47-32127(15040512 bytes)
 
Oct 14 18:05:35 database kernel: postmaster[26500] general protection ip:84a177 sp:7ffd9b349b88 error:0 in
postgres[400000+692000]
Oct 14 18:05:35 database abrt[27564]: Saved core dump of pid 26500 (/usr/pgsql-10/bin/postgres) to
/var/spool/abrt/ccpp-2017-10-14-18:05:35-26500(24137728 bytes)
 
Oct 16 23:21:22 database kernel: postmaster[31543] general protection ip:4bd467 sp:7ffe08a94890 error:0 in
postgres[400000+692000]
Oct 16 23:21:22 database abrt[570]: Saved core dump of pid 31543 (/usr/pgsql-10/bin/postgres) to
/var/spool/abrt/ccpp-2017-10-16-23:21:22-31543(25133056 bytes)
 
Oct 17 01:58:36 database kernel: postmaster[8646]: segfault at 8 ip 000000000084a177 sp 00007ffe08a94a88 error 4 in
postgres[400000+692000]
Oct 17 01:58:38 database abrt[9192]: Saved core dump of pid 8646 (/usr/pgsql-10/bin/postgres) to
/var/spool/abrt/ccpp-2017-10-17-01:58:38-8646(7692288 bytes)
 

> > If I understand, our crash isn't explained by the avw_database test
> > anyway (?)
> 
> I don't see why you would think that -- I disagree.

No problem - apparently I read too far into Tom's thoughts regarding memory
context.

I'll continue runnning with the existing patch and come back if the issue
recurs.

Thanks
Justin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SIGSEGV in BRIN autosummarize

From

Alvaro Herrera

Date:

18 October 2017, 20:22:27

Justin Pryzby wrote:
> On Wed, Oct 18, 2017 at 06:54:09PM +0200, Alvaro Herrera wrote:

> > And the previous code crashes in 45 minutes?  That's solid enough for
> > me; I'll clean up the patch and push in the next few days.  I think what
> > you have now should be sufficient for the time being for your production
> > system.
> 
> No - the crash happened 4 times since adding BRIN+autosummarize 6 days ago, and
> in once instance occured twice within 3 hours (while I was trying to query logs
> for the preceding crash).

Oh, okay.  Then we don't know enough yet, ISTM.

> [pryzbyj@database ~]$ sudo grep -hE 'in postgres|Saved core' /var/log/messages*
> Oct 13 17:22:45 database kernel: postmaster[32127] general protection ip:4bd467 sp:7ffd9b349990 error:0 in
postgres[400000+692000]
> Oct 13 17:22:47 database abrt[32387]: Saved core dump of pid 32127 (/usr/pgsql-10/bin/postgres) to
/var/spool/abrt/ccpp-2017-10-13-17:22:47-32127(15040512 bytes)
 
> Oct 14 18:05:35 database kernel: postmaster[26500] general protection ip:84a177 sp:7ffd9b349b88 error:0 in
postgres[400000+692000]
> Oct 14 18:05:35 database abrt[27564]: Saved core dump of pid 26500 (/usr/pgsql-10/bin/postgres) to
/var/spool/abrt/ccpp-2017-10-14-18:05:35-26500(24137728 bytes)
 
> Oct 16 23:21:22 database kernel: postmaster[31543] general protection ip:4bd467 sp:7ffe08a94890 error:0 in
postgres[400000+692000]
> Oct 16 23:21:22 database abrt[570]: Saved core dump of pid 31543 (/usr/pgsql-10/bin/postgres) to
/var/spool/abrt/ccpp-2017-10-16-23:21:22-31543(25133056 bytes)
 
> Oct 17 01:58:36 database kernel: postmaster[8646]: segfault at 8 ip 000000000084a177 sp 00007ffe08a94a88 error 4 in
postgres[400000+692000]
> Oct 17 01:58:38 database abrt[9192]: Saved core dump of pid 8646 (/usr/pgsql-10/bin/postgres) to
/var/spool/abrt/ccpp-2017-10-17-01:58:38-8646(7692288 bytes)
 

Do you still have those core dumps?  If so, would you please verify the
database that autovacuum was running in?  Just open each with gdb (using
the original postgres binary, not the one you just installed) and do
"print MyDatabaseId".

> I'll continue runnning with the existing patch and come back if the issue
> recurs.

Thanks.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SIGSEGV in BRIN autosummarize

From

Justin Pryzby

Date:

18 October 2017, 20:40:09

On Wed, Oct 18, 2017 at 07:22:27PM +0200, Alvaro Herrera wrote:
> Do you still have those core dumps?  If so, would you please verify the
> database that autovacuum was running in?  Just open each with gdb (using
> the original postgres binary, not the one you just installed) and do
> "print MyDatabaseId".

[pryzbyj@database ~]$ gdb ccpp-2017-10-16-23:21:22-31543/coredump -ex 'p MyDatabaseId' -ex q 2>/dev/null |tail -5
Core was generated by `postgres: autovacuum worker process   gtt             '.
Program terminated with signal 11, Segmentation fault.
#0  index_close (relation=0x324647603246466, lockmode=1) at indexam.c:178
178             LockRelId       relid = relation->rd_lockInfo.lockRelId;
$1 = 16400

[pryzbyj@database ~]$ gdb ccpp-2017-10-14-18:05:35-26500/coredump -ex 'p MyDatabaseId' -ex q 2>/dev/null |tail -5
Core was generated by `postgres: autovacuum worker process   gtt             '.
Program terminated with signal 11, Segmentation fault.
#0  pfree (pointer=0x298c740) at mcxt.c:954
954             (*context->methods->free_p) (context, pointer);
$1 = 16400

gtt=# SELECT oid,datname FROM pg_database;13456 | template016400 | gtt13457 | postgres    1 | template1

The gtt DB is where the (only) BRIN indicies are (not sure what to conclude
from that?)

Justin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SIGSEGV in BRIN autosummarize

From

Tom Lane

Date:

18 October 2017, 21:58:01

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> And the previous code crashes in 45 minutes?  That's solid enough for
> me; I'll clean up the patch and push in the next few days.  I think what
> you have now should be sufficient for the time being for your production
> system.

I'm still of the opinion that the presented patch isn't preventing any
crashes.  The failure to check avw_database would generally lead to
tasks silently getting dropped in the mistaken belief that the target
table has gone away.  I could believe that a crash occurs if the given
schema OID exists in some other DB, and the given table OID does too,
but it's for a table/index of the wrong kind ... but how likely is that?
Not very likely at all, given the way we generate OIDs.

(And if it did happen, I'd say the cause is a failure to recheck relkind
somewhere downstream of the autovac task dispatch code, anyway.)
        regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SIGSEGV in BRIN autosummarize

From

Alvaro Herrera

Date:

23 October 2017, 20:09:39

Tom Lane wrote:

> What I'm suspicious of as the actual bug cause is the comment in
> perform_work_item about how we need to be sure that we're allocating these
> strings in a long-lived context.  If, in fact, they were allocated in some
> context that could get reset during the PG_TRY (particularly in the
> error-cleanup path, which I bet we don't test), then the observed symptom
> that the pointers seem to be pointing at garbage could be explained.
> 
> So what I'm thinking is that you need an error during perform_work_item,
> and/or more than one work_item picked up in the calling loop, to make this
> bug manifest.  You would need to enter perform_work_item in a
> non-long-lived context, so the first time through the loop is probably
> safe anyway.

I created a reproducer for this bug, by 1) adding some sleeps to make
the summarization process take longer, 2) injecting errors randomly
during the summarization run, 3) inserting lots of rows using the
attached pgbench script running with 8 clients (creation script below).
Takes less than a second to crash.

What is happening is that we're freeing the strings (allocated in
TopTransactionContext) even in the case where the PG_CATCH block aborts
the transaction, which is obviously bogus.  I moved the frees to inside
the PG_TRY block and no crash occurs (I didn't like a 'goto' from
outside to inside the PG_TRY block, so I duplicated those lines
instead).  I propose the attached patch.

Before pushing, I'll give a look to the regular autovacuum path to see
if it needs a similar fix.

initialization:
CREATE TABLE t (a integer);
CREATE INDEX t_a_idx ON t USING brin (a) WITH (autosummarize='true', pages_per_range='16');

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] SIGSEGV in BRIN autosummarize

From

Alvaro Herrera

Date:

24 October 2017, 13:39:12

Alvaro Herrera wrote:

> Before pushing, I'll give a look to the regular autovacuum path to see
> if it needs a similar fix.

Reading that one, my conclusion is that it doesn't have the same problem
because the strings are allocated in AutovacuumMemCxt which is not reset
by error recovery.  This gave me the idea to use that context instead of
TopTransactionContext to store the strings in workitem processing also.
The patch I propose now is attached.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

v2-0001-Fix-autovacuum-work-items.patch

Re: [HACKERS] SIGSEGV in BRIN autosummarize

From

Alvaro Herrera

Date:

30 October 2017, 19:14:51

Alvaro Herrera wrote:
> Alvaro Herrera wrote:
> 
> > Before pushing, I'll give a look to the regular autovacuum path to see
> > if it needs a similar fix.
> 
> Reading that one, my conclusion is that it doesn't have the same problem
> because the strings are allocated in AutovacuumMemCxt which is not reset
> by error recovery.  This gave me the idea to use that context instead of
> TopTransactionContext to store the strings in workitem processing also.
> The patch I propose now is attached.

I ended up backpatching 335f3d04e4c8.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers