Completely broken replica after PANIC: WAL contains references to invalid pages - Mailing list pgsql-bugs
From | Sergey Konoplev |
---|---|
Subject | Completely broken replica after PANIC: WAL contains references to invalid pages |
Date | |
Msg-id | CAL_0b1t=WuM6roO8dki=w8DhH8P8whhohbPjReymmQUrOcNT2A@mail.gmail.com Whole thread Raw |
Responses |
Re: Completely broken replica after PANIC: WAL contains references to invalid pages
Re: Completely broken replica after PANIC: WAL contains references to invalid pages |
List | pgsql-bugs |
Hi all, A couple of days ago I found the replica stopped after the PANIC message: PANIC: WAL contains references to invalid pages When I tried to restart it I got this FATAL: FATAL: could not access status of transaction 280557568 Below is the description of the server and information from PostgreSQL and system logs. After googling the problem I have found nothing like this. Any thoughts of what it could be and how to prevent it in the future? Hardware: IBM System x3650 M4, 148GB RAM, NAS Software: PostgreSQL 9.2.3, yum.postgresql.org CentOS 6.3, kernel 2.6.32-279.22.1.el6.x86_64 Configuration: listen_addresses = '*' max_connections = 550 shared_buffers = 35GB work_mem = 256MB maintenance_work_mem = 1GB bgwriter_delay = 10ms bgwriter_lru_multiplier = 10.0 effective_io_concurrency = 32 wal_level = hot_standby synchronous_commit = off checkpoint_segments = 1024 checkpoint_timeout = 1h checkpoint_completion_target = 0.9 checkpoint_warning = 5min max_wal_senders = 3 wal_keep_segments = 2048 hot_standby = on max_standby_streaming_delay = 5min hot_standby_feedback = on effective_cache_size = 133GB log_directory = '/var/log/pgsql' log_filename = 'postgresql-%Y-%m-%d.log' log_checkpoints = on log_line_prefix = '%t %p %u@%d from %h [vxid:%v txid:%x] [%i] ' log_lock_waits = on log_statement = 'ddl' log_timezone = 'W-SU' track_activity_query_size = 4096 autovacuum_max_workers = 5 autovacuum_naptime = 5s autovacuum_vacuum_scale_factor = 0.05 autovacuum_analyze_scale_factor = 0.05 autovacuum_vacuum_cost_delay = 5ms datestyle = 'iso, dmy' timezone = 'W-SU' lc_messages = 'en_US.UTF-8' lc_monetary = 'ru_RU.UTF-8' lc_numeric = 'ru_RU.UTF-8' lc_time = 'ru_RU.UTF-8' default_text_search_config = 'pg_catalog.russian' System: # Controls the maximum shared segment size, in bytes kernel.shmmax = 53287555072 # Controls the maximum number of shared memory segments, in pages kernel.shmall = 13009657 # Maximum number of file-handles fs.file-max = 65535 # pdflush tuning to prevent lag spikes vm.dirty_ratio = 10 vm.dirty_background_ratio = 1 vm.dirty_expire_centisecs = 499 # Prevent the scheduler breakdown kernel.sched_migration_cost = 5000000 # Turned off to provide more CPU to PostgreSQL kernel.sched_autogroup_enabled = 0 # Setup hugepages vm.hugetlb_shm_group = 26 vm.hugepages_treat_as_movable = 0 vm.nr_overcommit_hugepages = 512 # The Huge Page Size is 2048kB, so for 35GB shared buffers the number is 17920 vm.nr_hugepages = 17920 # Turn off the NUMA local pages reclaim as it leads to wrong caching strategy for databases vm.zone_reclaim_mode = 0 Environment: HUGETLB_SHM=yes LD_PRELOAD='/usr/lib64/libhugetlbfs.so' export HUGETLB_SHM LD_PRELOAD When it is stopped: 2013-03-26 11:50:32 MSK 3775 @ from [vxid: txid:0] [] LOG: restartpoint complete: wrote 1685004 buffers (36.7%); 0 transaction log file(s) added, 0 removed, 555 recycled; write=3237.402 s, sync=0.071 s, total=3237.507 s; sync files=2673, longest=0.008 s, average=0.000 s 2013-03-26 11:50:32 MSK 3775 @ from [vxid: txid:0] [] LOG: recovery restart point at 2538/6E154AC0 2013-03-26 11:50:32 MSK 3775 @ from [vxid: txid:0] [] DETAIL: last completed transaction was at log time 2013-03-26 11:50:31.613948+04 2013-03-26 11:50:32 MSK 3775 @ from [vxid: txid:0] [] LOG: restartpoint starting: xlog 2013-03-26 11:51:16 MSK 3773 @ from [vxid:1/0 txid:0] [] WARNING: page 451 of relation base/16436/2686702648 is uninitialized 2013-03-26 11:51:16 MSK 3773 @ from [vxid:1/0 txid:0] [] CONTEXT: xlog redo vacuum: rel 1663/16436/2686702648; blk 2485, lastBlockVacuumed 0 2013-03-26 11:51:16 MSK 3773 @ from [vxid:1/0 txid:0] [] PANIC: WAL contains references to invalid pages 2013-03-26 11:51:16 MSK 3773 @ from [vxid:1/0 txid:0] [] CONTEXT: xlog redo vacuum: rel 1663/16436/2686702648; blk 2485, lastBlockVacuumed 0 2013-03-26 11:51:16 MSK 3770 @ from [vxid: txid:0] [] LOG: startup process (PID 3773) was terminated by signal 6: Aborted 2013-03-26 11:51:16 MSK 3770 @ from [vxid: txid:0] [] LOG: terminating any other active server processes From /var/log/messages: Mar 26 10:50:52 tms2 kernel: : postmaster: page allocation failure. order:8, mode:0xd0 Mar 26 10:50:52 tms2 kernel: : Pid: 3774, comm: postmaster Not tainted 2.6.32-279.22.1.el6.x86_64 #1 Mar 26 10:50:52 tms2 kernel: : Call Trace: Mar 26 10:50:52 tms2 kernel: : [<ffffffff8112343f>] ? __alloc_pages_nodemask+0x77f/0x940 Mar 26 10:50:52 tms2 kernel: : [<ffffffff8115d3e2>] ? kmem_getpages+0x62/0x170 Mar 26 10:50:52 tms2 kernel: : [<ffffffff8115dffa>] ? fallback_alloc+0x1ba/0x270 Mar 26 10:50:52 tms2 kernel: : [<ffffffff8115da4f>] ? cache_grow+0x2cf/0x320 Mar 26 10:50:52 tms2 kernel: : [<ffffffff8115dd79>] ? ____cache_alloc_node+0x99/0x160 Mar 26 10:50:52 tms2 kernel: : [<ffffffff813fd455>] ? dma_pin_iovec_pages+0xb5/0x230 Mar 26 10:50:52 tms2 kernel: : [<ffffffff8115eb49>] ? __kmalloc+0x189/0x220 Mar 26 10:50:52 tms2 kernel: : [<ffffffff813fd455>] ? dma_pin_iovec_pages+0xb5/0x230 Mar 26 10:50:52 tms2 kernel: : [<ffffffff8141a47c>] ? lock_sock_nested+0xac/0xc0 Mar 26 10:50:52 tms2 kernel: : [<ffffffff8146edaa>] ? tcp_recvmsg+0x4ca/0xe80 Mar 26 10:50:52 tms2 kernel: : [<ffffffffa029a67e>] ? xfs_vm_write_end+0x2e/0x60 [xfs] Mar 26 10:50:52 tms2 kernel: : [<ffffffffa0292f39>] ? xfs_trans_unlocked_item+0x39/0x60 [xfs] Mar 26 10:50:52 tms2 kernel: : [<ffffffff8148f90a>] ? inet_recvmsg+0x5a/0x90 Mar 26 10:50:52 tms2 kernel: : [<ffffffff81418b93>] ? sock_recvmsg+0x133/0x160 Mar 26 10:50:52 tms2 kernel: : [<ffffffffa029f542>] ? xfs_rw_iunlock+0x32/0x40 [xfs] Mar 26 10:50:52 tms2 kernel: : [<ffffffff81090be0>] ? autoremove_wake_function+0x0/0x40 Mar 26 10:50:52 tms2 kernel: : [<ffffffff81173fff>] ? __dentry_open+0x23f/0x360 Mar 26 10:50:52 tms2 kernel: : [<ffffffff811764ca>] ? do_sync_write+0xfa/0x140 Mar 26 10:50:52 tms2 kernel: : [<ffffffff811875d0>] ? do_filp_open+0x780/0xd60 Mar 26 10:50:52 tms2 kernel: : [<ffffffff81418d0e>] ? sys_recvfrom+0xee/0x180 Mar 26 10:50:52 tms2 kernel: : [<ffffffff81183105>] ? putname+0x35/0x50 Mar 26 10:50:52 tms2 kernel: : [<ffffffff81176842>] ? vfs_write+0x132/0x1a0 Mar 26 10:50:52 tms2 kernel: : [<ffffffff810d3f47>] ? audit_syscall_entry+0x1d7/0x200 Mar 26 10:50:52 tms2 kernel: : [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b After restart: 2013-03-27 10:43:01 MSK 1535 @ from [vxid: txid:0] [] LOG: database system was interrupted while in recovery at log time 2013-03-26 10:08:37 MSK 2013-03-27 10:43:01 MSK 1535 @ from [vxid: txid:0] [] HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. 2013-03-27 10:43:01 MSK 1535 @ from [vxid: txid:0] [] LOG: entering standby mode 2013-03-27 10:43:01 MSK 1535 @ from [vxid:1/0 txid:0] [] LOG: redo starts at 2538/6E154AC0 2013-03-27 10:43:02 MSK 1535 @ from [vxid:1/0 txid:0] [] LOG: file "pg_subtrans/10B7" doesn't exist, reading as zeroes 2013-03-27 10:43:02 MSK 1535 @ from [vxid:1/0 txid:0] [] CONTEXT: xlog redo xid assignment xtop 280482186: subxacts: 280483328 280483333 280483343 280483354 280483356 280483368 280483377 280483382 280483392 280483404 280483416 280483429 280483440 280483451 280483460 280483472 280483487 280483500 280483516 280483530 280483541 280483555 280483565 280483574 280483585 280483595 280483604 280483607 280483617 280483626 280483636 280483646 280483656 280483665 280483677 280483688 280483699 280483709 280483719 280483730 280483739 280483749 280483759 280483761 280483771 280483782 280483799 280483800 280483811 280483821 280483824 280483836 280483847 280483859 280483871 280483874 280483883 280483897 280483906 280483915 280483925 280483937 280483948 280483958 2013-03-27 10:43:02 MSK 1535 @ from [vxid:1/0 txid:0] [] LOG: file "pg_subtrans/10B7" doesn't exist, reading as zeroes 2013-03-27 10:43:02 MSK 1535 @ from [vxid:1/0 txid:0] [] CONTEXT: xlog redo xid assignment xtop 280482270: subxacts: 280485056 280485070 280485083 280485086 280485098 280485113 280485132 280485144 280485156 280485167 280485178 280485188 280485201 280485217 280485234 280485249 280485267 280485293 280485309 280485327 280485333 280485345 280485353 280485373 280485388 280485405 280485420 280485434 280485457 280485476 280485482 280485507 280485516 280485531 280485537 280485550 280485565 280485568 280485585 280485587 280485601 280485613 280485634 280485639 280485656 280485669 280485684 280485690 280485693 280485712 280485730 280485754 280485757 280485779 280485801 280485808 280485811 280485830 280485856 280485880 280485900 280485920 280485941 280485946 [ skipped several more messages of this kind] 2013-03-27 10:43:03 MSK 1535 @ from [vxid:1/0 txid:0] [] LOG: file "pg_subtrans/10B8" doesn't exist, reading as zeroes 2013-03-27 10:43:03 MSK 1535 @ from [vxid:1/0 txid:0] [] CONTEXT: xlog redo xid assignment xtop 280549936: subxacts: 280555514 280555515 280555516 280555517 280555518 280555519 280555520 280555521 280555522 280555523 280555524 280555525 280555526 280555527 280555528 280555529 280555530 280555531 280555532 280555533 280555534 280555535 280555536 280555537 280555538 280555539 280555540 280555541 280555542 280555543 280555544 280555545 280555546 280555547 280555548 280555549 280555550 280555551 280555552 280555553 280555554 280555555 280555556 280555557 280555558 280555559 280555560 280555561 280555562 280555563 280555564 280555565 280555566 280555567 280555568 280555569 280555570 280555571 280555572 280555573 280555574 280555575 280555576 280555577 2013-03-27 10:43:03 MSK 1535 @ from [vxid:1/0 txid:0] [] FATAL: could not access status of transaction 280557568 2013-03-27 10:43:03 MSK 1535 @ from [vxid:1/0 txid:0] [] DETAIL: Could not read from file "pg_subtrans/10B8" at offset 253952: Success. 2013-03-27 10:43:03 MSK 1535 @ from [vxid:1/0 txid:0] [] CONTEXT: xlog redo xid assignment xtop 280555981: subxacts: 280557520 280557521 280557522 280557523 280557524 280557525 280557526 280557527 280557528 280557529 280557530 280557531 280557532 280557533 280557534 280557535 280557536 280557537 280557538 280557539 280557540 280557541 280557542 280557543 280557544 280557545 280557546 280557547 280557548 280557549 280557550 280557551 280557552 280557553 280557554 280557555 280557556 280557557 280557558 280557559 280557560 280557561 280557562 280557563 280557564 280557565 280557566 280557567 280557568 280557569 280557570 280557571 280557572 280557573 280557574 280557575 280557576 280557577 280557578 280557579 280557580 280557581 280557582 280557583 2013-03-27 10:43:03 MSK 1532 @ from [vxid: txid:0] [] LOG: startup process (PID 1535) exited with exit code 1 2013-03-27 10:43:03 MSK 1532 @ from [vxid: txid:0] [] LOG: terminating any other active server processes Thank you in advance. -- Kind regards, Sergey Konoplev Database and Software Consultant Profile: http://www.linkedin.com/in/grayhemp Phone: USA +1 (415) 867-9984, Russia +7 (901) 903-0499, +7 (988) 888-1979 Skype: gray-hemp Jabber: gray.ru@gmail.com
pgsql-bugs by date: