Re: BUG #16331: segfault in checkpointer with full disk - Mailing list pgsql-bugs

From Julien Rouhaud
Subject Re: BUG #16331: segfault in checkpointer with full disk
Date
Msg-id 20200401090455.GB82418@nol
Whole thread Raw
In response to BUG #16331: segfault in checkpointer with full disk  (PG Bug reporting form <noreply@postgresql.org>)
Responses Re: BUG #16331: segfault in checkpointer with full disk
List pgsql-bugs
Hi,

On Wed, Apr 01, 2020 at 08:51:56AM +0000, PG Bug reporting form wrote:
> The following bug has been logged on the website:
> 
> Bug reference:      16331
> Logged by:          Jozef Mlich
> Email address:      jmlich83@gmail.com
> PostgreSQL version: 12.2
> Operating system:   CentOS
> Description:        
> 
> I can see segfaults on CentOS 7 with postgresql 12.2-2PGDG.rhel7 (from
> yum.postgresql.org). I am using multiple extensions  (cstore, postgres_fdw,
> pgcrypto,dblink, etc.). It seems crash is related to disk run out of space
> (I am using separate partion for / and for /var/lib/pgsql). It occurs few
> times a day. According to backtrace it seems to be related to checkpointer.
> Replication is not configured. 
> 
> 
> [New LWP 26290]
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib64/libthread_db.so.1".
> Core was generated by `postgres: checkpointer                               
>  '.
> Program terminated with signal 6, Aborted.
> #0  0x00007fe4604c1207 in __GI_raise (sig=sig@entry=6) at
> ../nptl/sysdeps/unix/sysv/linux/raise.c:55
> 55      return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
> 
> Thread 1 (Thread 0x7fe462e148c0 (LWP 26290)):
> #0  0x00007fe4604c1207 in __GI_raise (sig=sig@entry=6) at
> ../nptl/sysdeps/unix/sysv/linux/raise.c:55
>         resultvar = 0
>         pid = 26290
>         selftid = 26290
> #1  0x00007fe4604c28f8 in __GI_abort () at abort.c:90
>         save_stage = 2
>         act = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0},
> sa_mask = {__val = {0, 0, 0, 0, 0, 9268713, 70403103920717,
> 39808819211026438, 20126216749056, 70394513997832, 9268713, 70403103920719,
> 17316096998686159616, 20134806683648, 140618848608704, 140618848592800}},
> sa_flags = 1615828275, sa_restorer = 0x0}
>         sigs = {__val = {32, 0 <repeats 15 times>}}
> #2  0x000000000087840a in errfinish (dummy=<optimized out>) at elog.c:552
>         edata = 0xd47040 <errordata>
>         elevel = 22
>         oldcontext = 0x171a6d0
>         econtext = 0x0
>         __func__ = "errfinish"
> #3  0x0000000000706b24 in CheckPointReplicationOrigin () at origin.c:562
>         tmppath = 0x9e6fa8 "pg_logical/replorigin_checkpoint.tmp"
>         path = 0x9e6fd0 "pg_logical/replorigin_checkpoint"
>         tmpfd = <optimized out>
>         i = <optimized out>
>         magic = 307747550
>         crc = 4294967295
>         __func__ = "CheckPointReplicationOrigin"


That's not a bug (nor a segfault) but the expected behavior if the checkpointer
is not able to do its work.  As data durability can't be guaranteed in such
case, the checkpointer raises a PANIC level message, which raises an abort so
that the whole instance do an emergency restart cycle.

Do you have monitoring for this filesystem?  Do you see spikes in disk usage or
other strange behavior?



pgsql-bugs by date:

Previous
From: PG Bug reporting form
Date:
Subject: BUG #16331: segfault in checkpointer with full disk
Next
From: Jozef Mlich
Date:
Subject: Re: BUG #16331: segfault in checkpointer with full disk