Re: Completely broken replica after PANIC: WAL contains references to invalid pages - Mailing list pgsql-bugs
From | anarazel@anarazel.de |
---|---|
Subject | Re: Completely broken replica after PANIC: WAL contains references to invalid pages |
Date | |
Msg-id | 675b7cee-b7f0-4e32-8e34-1efaf3ca5fe9@email.android.com Whole thread Raw |
In response to | Completely broken replica after PANIC: WAL contains references to invalid pages (Sergey Konoplev <gray.ru@gmail.com>) |
Responses |
Re: Completely broken replica after PANIC: WAL contains
references to invalid pages
|
List | pgsql-bugs |
Hi Sergey Konoplev <gray.ru@gmail.com> schrieb: >Hi all, > >A couple of days ago I found the replica stopped after the PANIC >message: > >PANIC: WAL contains references to invalid pages > >When I tried to restart it I got this FATAL: > >FATAL: could not access status of transaction 280557568 > >Below is the description of the server and information from PostgreSQL >and system logs. After googling the problem I have found nothing like >this. > >Any thoughts of what it could be and how to prevent it in the future? I think I See whats going on. Do you still have the datadir available? If= so, could you send the pg_controldata output? Andres >Hardware: > >IBM System x3650 M4, 148GB RAM, NAS > >Software: > >PostgreSQL 9.2.3, yum.postgresql.org >CentOS 6.3, kernel 2.6.32-279.22.1.el6.x86_64 > >Configuration: > >listen_addresses =3D '*' >max_connections =3D 550 >shared_buffers =3D 35GB >work_mem =3D 256MB >maintenance_work_mem =3D 1GB >bgwriter_delay =3D 10ms >bgwriter_lru_multiplier =3D 10.0 >effective_io_concurrency =3D 32 >wal_level =3D hot_standby >synchronous_commit =3D off >checkpoint_segments =3D 1024 >checkpoint_timeout =3D 1h >checkpoint_completion_target =3D 0.9 >checkpoint_warning =3D 5min >max_wal_senders =3D 3 >wal_keep_segments =3D 2048 >hot_standby =3D on >max_standby_streaming_delay =3D 5min >hot_standby_feedback =3D on >effective_cache_size =3D 133GB >log_directory =3D '/var/log/pgsql' >log_filename =3D 'postgresql-%Y-%m-%d.log' >log_checkpoints =3D on >log_line_prefix =3D '%t %p %u@%d from %h [vxid:%v txid:%x] [%i] ' >log_lock_waits =3D on >log_statement =3D 'ddl' >log_timezone =3D 'W-SU' >track_activity_query_size =3D 4096 >autovacuum_max_workers =3D 5 >autovacuum_naptime =3D 5s >autovacuum_vacuum_scale_factor =3D 0.05 >autovacuum_analyze_scale_factor =3D 0.05 >autovacuum_vacuum_cost_delay =3D 5ms >datestyle =3D 'iso, dmy' >timezone =3D 'W-SU' >lc_messages =3D 'en_US.UTF-8' >lc_monetary =3D 'ru_RU.UTF-8' >lc_numeric =3D 'ru_RU.UTF-8' >lc_time =3D 'ru_RU.UTF-8' >default_text_search_config =3D 'pg_catalog.russian' > >System: > ># Controls the maximum shared segment size, in bytes >kernel.shmmax =3D 53287555072 > ># Controls the maximum number of shared memory segments, in pages >kernel.shmall =3D 13009657 > ># Maximum number of file-handles >fs.file-max =3D 65535 > ># pdflush tuning to prevent lag spikes >vm.dirty_ratio =3D 10 >vm.dirty_background_ratio =3D 1 >vm.dirty_expire_centisecs =3D 499 > ># Prevent the scheduler breakdown >kernel.sched_migration_cost =3D 5000000 > ># Turned off to provide more CPU to PostgreSQL >kernel.sched_autogroup_enabled =3D 0 > ># Setup hugepages >vm.hugetlb_shm_group =3D 26 >vm.hugepages_treat_as_movable =3D 0 >vm.nr_overcommit_hugepages =3D 512 > ># The Huge Page Size is 2048kB, so for 35GB shared buffers the number >is 17920 >vm.nr_hugepages =3D 17920 > ># Turn off the NUMA local pages reclaim as it leads to wrong caching >strategy for databases >vm.zone_reclaim_mode =3D 0 > >Environment: > >HUGETLB_SHM=3Dyes >LD_PRELOAD=3D'/usr/lib64/libhugetlbfs.so' >export HUGETLB_SHM LD_PRELOAD > >When it is stopped: > >2013-03-26 11:50:32 MSK 3775 @ from [vxid: txid:0] [] LOG: >restartpoint complete: wrote 1685004 buffers (36.7%); 0 transaction >log file(s) added, 0 removed, 555 recycled; write=3D3237.402 s, >sync=3D0.071 s, total=3D3237.507 s; sync files=3D2673, longest=3D0.008 s= , >average=3D0.000 s >2013-03-26 11:50:32 MSK 3775 @ from [vxid: txid:0] [] LOG: recovery >restart point at 2538/6E154AC0 >2013-03-26 11:50:32 MSK 3775 @ from [vxid: txid:0] [] DETAIL: last >completed transaction was at log time 2013-03-26 11:50:31.613948+04 >2013-03-26 11:50:32 MSK 3775 @ from [vxid: txid:0] [] LOG: >restartpoint starting: xlog >2013-03-26 11:51:16 MSK 3773 @ from [vxid:1/0 txid:0] [] WARNING: >page 451 of relation base/16436/2686702648 is uninitialized >2013-03-26 11:51:16 MSK 3773 @ from [vxid:1/0 txid:0] [] CONTEXT: >xlog redo vacuum: rel 1663/16436/2686702648; blk 2485, >lastBlockVacuumed 0 >2013-03-26 11:51:16 MSK 3773 @ from [vxid:1/0 txid:0] [] PANIC: WAL >contains references to invalid pages >2013-03-26 11:51:16 MSK 3773 @ from [vxid:1/0 txid:0] [] CONTEXT: >xlog redo vacuum: rel 1663/16436/2686702648; blk 2485, >lastBlockVacuumed 0 >2013-03-26 11:51:16 MSK 3770 @ from [vxid: txid:0] [] LOG: startup >process (PID 3773) was terminated by signal 6: Aborted >2013-03-26 11:51:16 MSK 3770 @ from [vxid: txid:0] [] LOG: >terminating any other active server processes > >From /var/log/messages: > >Mar 26 10:50:52 tms2 kernel: : postmaster: page allocation failure. >order:8, mode:0xd0 >Mar 26 10:50:52 tms2 kernel: : Pid: 3774, comm: postmaster Not tainted >2.6.32-279.22.1.el6.x86_64 #1 >Mar 26 10:50:52 tms2 kernel: : Call Trace: >Mar 26 10:50:52 tms2 kernel: : [<ffffffff8112343f>] ? >__alloc_pages_nodemask+0x77f/0x940 >Mar 26 10:50:52 tms2 kernel: : [<ffffffff8115d3e2>] ? >kmem_getpages+0x62/0x170 >Mar 26 10:50:52 tms2 kernel: : [<ffffffff8115dffa>] ? >fallback_alloc+0x1ba/0x270 >Mar 26 10:50:52 tms2 kernel: : [<ffffffff8115da4f>] ? >cache_grow+0x2cf/0x320 >Mar 26 10:50:52 tms2 kernel: : [<ffffffff8115dd79>] ? >____cache_alloc_node+0x99/0x160 >Mar 26 10:50:52 tms2 kernel: : [<ffffffff813fd455>] ? >dma_pin_iovec_pages+0xb5/0x230 >Mar 26 10:50:52 tms2 kernel: : [<ffffffff8115eb49>] ? >__kmalloc+0x189/0x220 >Mar 26 10:50:52 tms2 kernel: : [<ffffffff813fd455>] ? >dma_pin_iovec_pages+0xb5/0x230 >Mar 26 10:50:52 tms2 kernel: : [<ffffffff8141a47c>] ? >lock_sock_nested+0xac/0xc0 >Mar 26 10:50:52 tms2 kernel: : [<ffffffff8146edaa>] ? >tcp_recvmsg+0x4ca/0xe80 >Mar 26 10:50:52 tms2 kernel: : [<ffffffffa029a67e>] ? >xfs_vm_write_end+0x2e/0x60 [xfs] >Mar 26 10:50:52 tms2 kernel: : [<ffffffffa0292f39>] ? >xfs_trans_unlocked_item+0x39/0x60 [xfs] >Mar 26 10:50:52 tms2 kernel: : [<ffffffff8148f90a>] ? >inet_recvmsg+0x5a/0x90 >Mar 26 10:50:52 tms2 kernel: : [<ffffffff81418b93>] ? >sock_recvmsg+0x133/0x160 >Mar 26 10:50:52 tms2 kernel: : [<ffffffffa029f542>] ? >xfs_rw_iunlock+0x32/0x40 [xfs] >Mar 26 10:50:52 tms2 kernel: : [<ffffffff81090be0>] ? >autoremove_wake_function+0x0/0x40 >Mar 26 10:50:52 tms2 kernel: : [<ffffffff81173fff>] ? >__dentry_open+0x23f/0x360 >Mar 26 10:50:52 tms2 kernel: : [<ffffffff811764ca>] ? >do_sync_write+0xfa/0x140 >Mar 26 10:50:52 tms2 kernel: : [<ffffffff811875d0>] ? >do_filp_open+0x780/0xd60 >Mar 26 10:50:52 tms2 kernel: : [<ffffffff81418d0e>] ? >sys_recvfrom+0xee/0x180 >Mar 26 10:50:52 tms2 kernel: : [<ffffffff81183105>] ? putname+0x35/0x50 >Mar 26 10:50:52 tms2 kernel: : [<ffffffff81176842>] ? >vfs_write+0x132/0x1a0 >Mar 26 10:50:52 tms2 kernel: : [<ffffffff810d3f47>] ? >audit_syscall_entry+0x1d7/0x200 >Mar 26 10:50:52 tms2 kernel: : [<ffffffff8100b072>] ? >system_call_fastpath+0x16/0x1b > >After restart: > >2013-03-27 10:43:01 MSK 1535 @ from [vxid: txid:0] [] LOG: database >system was interrupted while in recovery at log time 2013-03-26 >10:08:37 MSK >2013-03-27 10:43:01 MSK 1535 @ from [vxid: txid:0] [] HINT: If this >has occurred more than once some data might be corrupted and you might >need to choose an earlier recovery target. >2013-03-27 10:43:01 MSK 1535 @ from [vxid: txid:0] [] LOG: entering >standby mode >2013-03-27 10:43:01 MSK 1535 @ from [vxid:1/0 txid:0] [] LOG: redo >starts at 2538/6E154AC0 >2013-03-27 10:43:02 MSK 1535 @ from [vxid:1/0 txid:0] [] LOG: file >"pg_subtrans/10B7" doesn't exist, reading as zeroes >2013-03-27 10:43:02 MSK 1535 @ from [vxid:1/0 txid:0] [] CONTEXT: >xlog redo xid assignment xtop 280482186: subxacts: 280483328 280483333 >280483343 280483354 280483356 280483368 280483377 280483382 280483392 >280483404 280483416 280483429 280483440 280483451 280483460 280483472 >280483487 280483500 280483516 280483530 280483541 280483555 280483565 >280483574 280483585 280483595 280483604 280483607 280483617 280483626 >280483636 280483646 280483656 280483665 280483677 280483688 280483699 >280483709 280483719 280483730 280483739 280483749 280483759 280483761 >280483771 280483782 280483799 280483800 280483811 280483821 280483824 >280483836 280483847 280483859 280483871 280483874 280483883 280483897 >280483906 280483915 280483925 280483937 280483948 280483958 >2013-03-27 10:43:02 MSK 1535 @ from [vxid:1/0 txid:0] [] LOG: file >"pg_subtrans/10B7" doesn't exist, reading as zeroes >2013-03-27 10:43:02 MSK 1535 @ from [vxid:1/0 txid:0] [] CONTEXT: >xlog redo xid assignment xtop 280482270: subxacts: 280485056 280485070 >280485083 280485086 280485098 280485113 280485132 280485144 280485156 >280485167 280485178 280485188 280485201 280485217 280485234 280485249 >280485267 280485293 280485309 280485327 280485333 280485345 280485353 >280485373 280485388 280485405 280485420 280485434 280485457 280485476 >280485482 280485507 280485516 280485531 280485537 280485550 280485565 >280485568 280485585 280485587 280485601 280485613 280485634 280485639 >280485656 280485669 280485684 280485690 280485693 280485712 280485730 >280485754 280485757 280485779 280485801 280485808 280485811 280485830 >280485856 280485880 280485900 280485920 280485941 280485946 > >[ skipped several more messages of this kind] > >2013-03-27 10:43:03 MSK 1535 @ from [vxid:1/0 txid:0] [] LOG: file >"pg_subtrans/10B8" doesn't exist, reading as zeroes >2013-03-27 10:43:03 MSK 1535 @ from [vxid:1/0 txid:0] [] CONTEXT: >xlog redo xid assignment xtop 280549936: subxacts: 280555514 280555515 >280555516 280555517 280555518 280555519 280555520 280555521 280555522 >280555523 280555524 280555525 280555526 280555527 280555528 280555529 >280555530 280555531 280555532 280555533 280555534 280555535 280555536 >280555537 280555538 280555539 280555540 280555541 280555542 280555543 >280555544 280555545 280555546 280555547 280555548 280555549 280555550 >280555551 280555552 280555553 280555554 280555555 280555556 280555557 >280555558 280555559 280555560 280555561 280555562 280555563 280555564 >280555565 280555566 280555567 280555568 280555569 280555570 280555571 >280555572 280555573 280555574 280555575 280555576 280555577 >2013-03-27 10:43:03 MSK 1535 @ from [vxid:1/0 txid:0] [] FATAL: >could not access status of transaction 280557568 >2013-03-27 10:43:03 MSK 1535 @ from [vxid:1/0 txid:0] [] DETAIL: >Could not read from file "pg_subtrans/10B8" at offset 253952: Success. >2013-03-27 10:43:03 MSK 1535 @ from [vxid:1/0 txid:0] [] CONTEXT: >xlog redo xid assignment xtop 280555981: subxacts: 280557520 280557521 >280557522 280557523 280557524 280557525 280557526 280557527 280557528 >280557529 280557530 280557531 280557532 280557533 280557534 280557535 >280557536 280557537 280557538 280557539 280557540 280557541 280557542 >280557543 280557544 280557545 280557546 280557547 280557548 280557549 >280557550 280557551 280557552 280557553 280557554 280557555 280557556 >280557557 280557558 280557559 280557560 280557561 280557562 280557563 >280557564 280557565 280557566 280557567 280557568 280557569 280557570 >280557571 280557572 280557573 280557574 280557575 280557576 280557577 >280557578 280557579 280557580 280557581 280557582 280557583 >2013-03-27 10:43:03 MSK 1532 @ from [vxid: txid:0] [] LOG: startup >process (PID 1535) exited with exit code 1 >2013-03-27 10:43:03 MSK 1532 @ from [vxid: txid:0] [] LOG: >terminating any other active server processes > >Thank you in advance. > >-- >Kind regards, >Sergey Konoplev >Database and Software Consultant > >Profile: http://www.linkedin.com/in/grayhemp >Phone: USA +1 (415) 867-9984, Russia +7 (901) 903-0499, +7 (988) >888-1979 >Skype: gray-hemp >Jabber: gray.ru@gmail.com ---=20 Please excuse brevity and formatting - I am writing this on my mobile pho= ne. ---=20 Please excuse brevity and formatting - I am writing this on my mobile pho= ne.
pgsql-bugs by date: