Thread: BUG #16331: segfault in checkpointer with full disk
The following bug has been logged on the website: Bug reference: 16331 Logged by: Jozef Mlich Email address: jmlich83@gmail.com PostgreSQL version: 12.2 Operating system: CentOS Description: I can see segfaults on CentOS 7 with postgresql 12.2-2PGDG.rhel7 (from yum.postgresql.org). I am using multiple extensions (cstore, postgres_fdw, pgcrypto,dblink, etc.). It seems crash is related to disk run out of space (I am using separate partion for / and for /var/lib/pgsql). It occurs few times a day. According to backtrace it seems to be related to checkpointer. Replication is not configured. [New LWP 26290] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `postgres: checkpointer '. Program terminated with signal 6, Aborted. #0 0x00007fe4604c1207 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55 55 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig); Thread 1 (Thread 0x7fe462e148c0 (LWP 26290)): #0 0x00007fe4604c1207 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55 resultvar = 0 pid = 26290 selftid = 26290 #1 0x00007fe4604c28f8 in __GI_abort () at abort.c:90 save_stage = 2 act = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, sa_mask = {__val = {0, 0, 0, 0, 0, 9268713, 70403103920717, 39808819211026438, 20126216749056, 70394513997832, 9268713, 70403103920719, 17316096998686159616, 20134806683648, 140618848608704, 140618848592800}}, sa_flags = 1615828275, sa_restorer = 0x0} sigs = {__val = {32, 0 <repeats 15 times>}} #2 0x000000000087840a in errfinish (dummy=<optimized out>) at elog.c:552 edata = 0xd47040 <errordata> elevel = 22 oldcontext = 0x171a6d0 econtext = 0x0 __func__ = "errfinish" #3 0x0000000000706b24 in CheckPointReplicationOrigin () at origin.c:562 tmppath = 0x9e6fa8 "pg_logical/replorigin_checkpoint.tmp" path = 0x9e6fd0 "pg_logical/replorigin_checkpoint" tmpfd = <optimized out> i = <optimized out> magic = 307747550 crc = 4294967295 __func__ = "CheckPointReplicationOrigin" #4 0x0000000000509766 in CheckPointGuts (checkPointRedo=243199514360, flags=flags@entry=256) at xlog.c:9007 No locals. #5 0x0000000000510106 in CreateCheckPoint (flags=flags@entry=256) at xlog.c:8795 shutdown = false checkPoint = {redo = 243199514360, ThisTimeLineID = 1, PrevTimeLineID = 1, fullPageWrites = true, nextFullXid = {value = 20251117}, nextOid = 9645082, nextMulti = 3, nextMultiOffset = 5, oldestXid = 479, oldestXidDB = 14186, oldestMulti = 1, oldestMultiDB = 1, time = 1585692001, oldestCommitTsXid = 0, newestCommitTsXid = 0, oldestActiveXid = 20251117} recptr = <optimized out> _logSegNo = 0 Insert = <optimized out> freespace = <optimized out> PriorRedoPtr = <optimized out> curInsert = 243199514360 last_important_lsn = <optimized out> vxids = 0x17389d8 nvxids = 0 __func__ = "CreateCheckPoint" #6 0x00000000006e4982 in CheckpointerMain () at checkpointer.c:481 ckpt_performed = false do_restartpoint = <optimized out> flags = 256 do_checkpoint = <optimized out> now = 1585692001 elapsed_secs = <optimized out> cur_timeout = <optimized out> local_sigjmp_buf = {{__jmpbuf = {140732342117584, -1758805613090239410, 2, 9211212, 0, 24183888, -1758805613174125490, 1759480033986153550}, __mask_was_saved = 1, __saved_mask = {__val = {18446744066192964103, 0 <repeats 15 times>}}}} checkpointer_context = 0x171a6d0 __func__ = "CheckpointerMain" #7 0x000000000051fe75 in AuxiliaryProcessMain (argc=argc@entry=2, argv=argv@entry=0x7ffecd42a480) at bootstrap.c:461 progname = 0x8c8d4c "postgres" flag = <optimized out> userDoption = 0x0 __func__ = "AuxiliaryProcessMain" #8 0x00000000006eef70 in StartChildProcess (type=CheckpointerProcess) at postmaster.c:5392 pid = <optimized out> av = {0x8c8d4c "postgres", 0x7ffecd42a4d0 "-x4", 0x0, 0x0, 0x16c92e0 "postgres-12", 0x5 <Address 0x5 out of bounds>, 0x7ffecd42a400 "L\215\214", 0x6 <Address 0x6 out of bounds>, 0x0, 0x7fe33f1daff8 "@NL`\344\177"} ac = 2 typebuf = "-x4\000\376\177\000\000\000s\345\306y\"O\360\001\000\000\000]\201\240J\000\000\000\000\000\000\000" #9 0x00000000006f02ba in reaper (postgres_signal_arg=<optimized out>) at postmaster.c:2973 save_errno = 4 pid = <optimized out> exitstatus = 0 __func__ = "reaper" #10 <signal handler called> No locals. #11 0x00007fe46057ff53 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:81 No locals. #12 0x0000000000483484 in ServerLoop () at postmaster.c:1668 timeout = {tv_sec = 59, tv_usec = 872935} rmask = {fds_bits = {120, 0 <repeats 15 times>}} selres = <optimized out> now = <optimized out> readmask = {fds_bits = {120, 0 <repeats 15 times>}} last_lockfile_recheck_time = 1585685433 last_touch_time = 1585685433 __func__ = "ServerLoop" #13 0x00000000006f195f in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x16c9300) at postmaster.c:1377 opt = <optimized out> status = <optimized out> userDoption = <optimized out> listen_addr_saved = true i = <optimized out> output_config_variable = <optimized out> __func__ = "PostmasterMain" #14 0x0000000000484dd3 in main (argc=3, argv=0x16c9300) at main.c:228 No locals. Please let me know if I can provide some more information. best regards, Jozef Mlich
Hi, On Wed, Apr 01, 2020 at 08:51:56AM +0000, PG Bug reporting form wrote: > The following bug has been logged on the website: > > Bug reference: 16331 > Logged by: Jozef Mlich > Email address: jmlich83@gmail.com > PostgreSQL version: 12.2 > Operating system: CentOS > Description: > > I can see segfaults on CentOS 7 with postgresql 12.2-2PGDG.rhel7 (from > yum.postgresql.org). I am using multiple extensions (cstore, postgres_fdw, > pgcrypto,dblink, etc.). It seems crash is related to disk run out of space > (I am using separate partion for / and for /var/lib/pgsql). It occurs few > times a day. According to backtrace it seems to be related to checkpointer. > Replication is not configured. > > > [New LWP 26290] > [Thread debugging using libthread_db enabled] > Using host libthread_db library "/lib64/libthread_db.so.1". > Core was generated by `postgres: checkpointer > '. > Program terminated with signal 6, Aborted. > #0 0x00007fe4604c1207 in __GI_raise (sig=sig@entry=6) at > ../nptl/sysdeps/unix/sysv/linux/raise.c:55 > 55 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig); > > Thread 1 (Thread 0x7fe462e148c0 (LWP 26290)): > #0 0x00007fe4604c1207 in __GI_raise (sig=sig@entry=6) at > ../nptl/sysdeps/unix/sysv/linux/raise.c:55 > resultvar = 0 > pid = 26290 > selftid = 26290 > #1 0x00007fe4604c28f8 in __GI_abort () at abort.c:90 > save_stage = 2 > act = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, > sa_mask = {__val = {0, 0, 0, 0, 0, 9268713, 70403103920717, > 39808819211026438, 20126216749056, 70394513997832, 9268713, 70403103920719, > 17316096998686159616, 20134806683648, 140618848608704, 140618848592800}}, > sa_flags = 1615828275, sa_restorer = 0x0} > sigs = {__val = {32, 0 <repeats 15 times>}} > #2 0x000000000087840a in errfinish (dummy=<optimized out>) at elog.c:552 > edata = 0xd47040 <errordata> > elevel = 22 > oldcontext = 0x171a6d0 > econtext = 0x0 > __func__ = "errfinish" > #3 0x0000000000706b24 in CheckPointReplicationOrigin () at origin.c:562 > tmppath = 0x9e6fa8 "pg_logical/replorigin_checkpoint.tmp" > path = 0x9e6fd0 "pg_logical/replorigin_checkpoint" > tmpfd = <optimized out> > i = <optimized out> > magic = 307747550 > crc = 4294967295 > __func__ = "CheckPointReplicationOrigin" That's not a bug (nor a segfault) but the expected behavior if the checkpointer is not able to do its work. As data durability can't be guaranteed in such case, the checkpointer raises a PANIC level message, which raises an abort so that the whole instance do an emergency restart cycle. Do you have monitoring for this filesystem? Do you see spikes in disk usage or other strange behavior?
On Wed, 2020-04-01 at 11:04 +0200, Julien Rouhaud wrote: > Hi, > > On Wed, Apr 01, 2020 at 08:51:56AM +0000, PG Bug reporting form > wrote: > > The following bug has been logged on the website: > > > > Bug reference: 16331 > > Logged by: Jozef Mlich > > Email address: jmlich83@gmail.com > > PostgreSQL version: 12.2 > > Operating system: CentOS > > Description: > > > > I can see segfaults on CentOS 7 with postgresql 12.2-2PGDG.rhel7 > > (from > > yum.postgresql.org). I am using multiple extensions (cstore, > > postgres_fdw, > > pgcrypto,dblink, etc.). It seems crash is related to disk run out > > of space > > (I am using separate partion for / and for /var/lib/pgsql). It > > occurs few > > times a day. According to backtrace it seems to be related to > > checkpointer. > > Replication is not configured. > > > > > > [New LWP 26290] > > [Thread debugging using libthread_db enabled] > > Using host libthread_db library "/lib64/libthread_db.so.1". > > Core was generated by `postgres: > > checkpointer > > '. > > Program terminated with signal 6, Aborted. > > #0 0x00007fe4604c1207 in __GI_raise (sig=sig@entry=6) at > > ../nptl/sysdeps/unix/sysv/linux/raise.c:55 > > 55 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig); > > > > Thread 1 (Thread 0x7fe462e148c0 (LWP 26290)): > > #0 0x00007fe4604c1207 in __GI_raise (sig=sig@entry=6) at > > ../nptl/sysdeps/unix/sysv/linux/raise.c:55 > > resultvar = 0 > > pid = 26290 > > selftid = 26290 > > #1 0x00007fe4604c28f8 in __GI_abort () at abort.c:90 > > save_stage = 2 > > act = {__sigaction_handler = {sa_handler = 0x0, > > sa_sigaction = 0x0}, > > sa_mask = {__val = {0, 0, 0, 0, 0, 9268713, 70403103920717, > > 39808819211026438, 20126216749056, 70394513997832, 9268713, > > 70403103920719, > > 17316096998686159616, 20134806683648, 140618848608704, > > 140618848592800}}, > > sa_flags = 1615828275, sa_restorer = 0x0} > > sigs = {__val = {32, 0 <repeats 15 times>}} > > #2 0x000000000087840a in errfinish (dummy=<optimized out>) at > > elog.c:552 > > edata = 0xd47040 <errordata> > > elevel = 22 > > oldcontext = 0x171a6d0 > > econtext = 0x0 > > __func__ = "errfinish" > > #3 0x0000000000706b24 in CheckPointReplicationOrigin () at > > origin.c:562 > > tmppath = 0x9e6fa8 "pg_logical/replorigin_checkpoint.tmp" > > path = 0x9e6fd0 "pg_logical/replorigin_checkpoint" > > tmpfd = <optimized out> > > i = <optimized out> > > magic = 307747550 > > crc = 4294967295 > > __func__ = "CheckPointReplicationOrigin" > > That's not a bug (nor a segfault) but the expected behavior if the > checkpointer is not able to do its work. As data durability can't be > guaranteed in such case, the checkpointer raises a PANIC level > message, which raises an abort so that the whole instance do an > emergency restart cycle. > > Do you have monitoring for this filesystem? Do you see spikes in > disk usage or other strange behavior? Then it is clear. Thanks for explanation and applogize for false bug report. I have probably misunderstood how is segfault distinguished from abort. I need to fix my kernel.core_pattern script. In attachment is screenshot from monitoring grafana with information about space on /var/lib/pgsql partition. -- Jozef Mlich <jmlich83@gmail.com>
Attachment
On Wed, Apr 1, 2020 at 11:51 AM Jozef Mlich <jmlich83@gmail.com> wrote: > > On Wed, 2020-04-01 at 11:04 +0200, Julien Rouhaud wrote: > > Hi, > > > > On Wed, Apr 01, 2020 at 08:51:56AM +0000, PG Bug reporting form > > wrote: > > > > > > I can see segfaults on CentOS 7 with postgresql 12.2-2PGDG.rhel7 > > > (from > > > yum.postgresql.org). I am using multiple extensions (cstore, > > > postgres_fdw, > > > pgcrypto,dblink, etc.). It seems crash is related to disk run out > > > of space > > > (I am using separate partion for / and for /var/lib/pgsql). It > > > occurs few > > > times a day. According to backtrace it seems to be related to > > > checkpointer. > > > Replication is not configured. > > > > > > > > > [New LWP 26290] > > > [Thread debugging using libthread_db enabled] > > > Using host libthread_db library "/lib64/libthread_db.so.1". > > > Core was generated by `postgres: > > > checkpointer > > > '. > > > Program terminated with signal 6, Aborted. > > > #0 0x00007fe4604c1207 in __GI_raise (sig=sig@entry=6) at > > > ../nptl/sysdeps/unix/sysv/linux/raise.c:55 > > > 55 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig); > > > > > > Thread 1 (Thread 0x7fe462e148c0 (LWP 26290)): > > > #0 0x00007fe4604c1207 in __GI_raise (sig=sig@entry=6) at > > > ../nptl/sysdeps/unix/sysv/linux/raise.c:55 > > > resultvar = 0 > > > pid = 26290 > > > selftid = 26290 > > > #1 0x00007fe4604c28f8 in __GI_abort () at abort.c:90 > > > save_stage = 2 > > > act = {__sigaction_handler = {sa_handler = 0x0, > > > sa_sigaction = 0x0}, > > > sa_mask = {__val = {0, 0, 0, 0, 0, 9268713, 70403103920717, > > > 39808819211026438, 20126216749056, 70394513997832, 9268713, > > > 70403103920719, > > > 17316096998686159616, 20134806683648, 140618848608704, > > > 140618848592800}}, > > > sa_flags = 1615828275, sa_restorer = 0x0} > > > sigs = {__val = {32, 0 <repeats 15 times>}} > > > #2 0x000000000087840a in errfinish (dummy=<optimized out>) at > > > elog.c:552 > > > edata = 0xd47040 <errordata> > > > elevel = 22 > > > oldcontext = 0x171a6d0 > > > econtext = 0x0 > > > __func__ = "errfinish" > > > #3 0x0000000000706b24 in CheckPointReplicationOrigin () at > > > origin.c:562 > > > tmppath = 0x9e6fa8 "pg_logical/replorigin_checkpoint.tmp" > > > path = 0x9e6fd0 "pg_logical/replorigin_checkpoint" > > > tmpfd = <optimized out> > > > i = <optimized out> > > > magic = 307747550 > > > crc = 4294967295 > > > __func__ = "CheckPointReplicationOrigin" > > > > That's not a bug (nor a segfault) but the expected behavior if the > > checkpointer is not able to do its work. As data durability can't be > > guaranteed in such case, the checkpointer raises a PANIC level > > message, which raises an abort so that the whole instance do an > > emergency restart cycle. > > > > Do you have monitoring for this filesystem? Do you see spikes in > > disk usage or other strange behavior? > > Then it is clear. Thanks for explanation and applogize for false bug > report. > > I have probably misunderstood how is segfault distinguished from abort. > I need to fix my kernel.core_pattern script. > > In attachment is screenshot from monitoring grafana with information > about space on /var/lib/pgsql partition. The main filesystem is full or almost full most of the time? That's unfortunately a good way to trigger that kind of outage. Is that because most of the data is on a different tablespace? Even in that case you need to ensure that you still have at least a reasonable amount of free space.