Thread: BUG #14171: Wrong FSM file after switching hot standby to master
VGhlIGZvbGxvd2luZyBidWcgaGFzIGJlZW4gbG9nZ2VkIG9uIHRoZSB3ZWJz aXRlOgoKQnVnIHJlZmVyZW5jZTogICAgICAxNDE3MQpMb2dnZWQgYnk6ICAg ICAgICAgIFRpbW9mZWkgRHluaWtvdgpFbWFpbCBhZGRyZXNzOiAgICAgIHRp bW9mZWlkQG91dGxvb2suY29tClBvc3RncmVTUUwgdmVyc2lvbjogOS40LjQK T3BlcmF0aW5nIHN5c3RlbTogICBSSEVMIDYuNgpEZXNjcmlwdGlvbjogICAg ICAgIAoKSGkNCg0KV2UgaGF2ZSBhbiBpbnN0YWxsYXRpb24gb2YgUG9zdGdy ZXMgOS40LjQoUG9zdGdyZVNRTCA5LjQuNCBvbgp4ODZfNjQtdW5rbm93bi1s aW51eC1nbnUsIGNvbXBpbGVkIGJ5IGdjYyAoR0NDKSA0LjQuNyAyMDEyMDMx MyAoUmVkIEhhdAo0LjQuNy0xMSksIDY0LWJpdCkgb24gUkhFTCA2LjYuIERC IGluc3RhbGxlZCBvbiAyIG5vZGVzLCAxIG5vZGUgaXMgbWFzdGVyLAphbm90 aGVyIG5vZGUgaXMgaG90IHN0YW5kYnkoc3RyZWFtaW5nIHJlcGxpY2F0aW9u KS4gREIgaXMgbW9uaXRvcmVkIGJ5CnBhY9C1bWFrZXIgcGdzcWwgYWdlbnQu DQpTb21ldGltZXMgd2UgaGF2ZSB0cm91YmxlcyB3aXRoIGZzbS1maWxlcy4g SW4gY2FzZTogDQog4oCiIG1hc3RlciBpbnN0YW5jZSBpcyBzd2l0Y2hpbmcg dG8gYW5vdGhlciBub2RlKGZhaWxvdmVyIG9yIHN3aXRjaG92ZXIpIG9uCmhp Z2hsb2FkDQog4oCiIEhvdCBzdGFuZGJ5IG5vZGUgcmVzdGFydCBhbmQgcnVu IGFzIG1hc3RlciBzdWNjZXNmdWxseS4NCiDigKIgQWZ0ZXIgdGhhdCB3ZSBz b21ldGltZXMgZ2V0IEZTTSBmaWxlcyBwb2ludGluZyB0byBub24tZXhpc3Rl bnQgYmxvY2tzIGluCnRoZSB0YWJsZSwgc28gc3Vic2VxdWVudCBpbnNlcnQg b3BlcmF0aW9ucyBvbiBzdWNoIHRhYmxlcyBmYWlscyB3aXRoIGVycm9yCm1l c3NhZ2UgbGlrZSBmb2xsb3dpbmc6ICdjb3VsZCBub3QgcmVhZCBibG9jayBY WCBpbiBmaWxlICJiYXNlL1lZWVkvWlpaWloiJy4KVGhlIGlzc3VlIGNhbiBi ZSByZXNvbHZlZCBieSBlaXRoZXIgZGVsZXRpbmcgb2Ygd3JvbmcgRlNNIGZp bGUgKHdoaWxlCmRhdGFiYXNlIGlzIHN0b3BwZWQpIG9yIHBlcmZvcm1pbmcg VkFDVVVNIEZVTEwgb24gZXJyb25lb3VzIHRhYmxlLiBUaGUKcHJvYmxlbSBp cyB1c3VhbGx5IG9ic2VydmVkIG9uIHJlbGF0aXZlbHkgc21hbGwgdGFibGVz IChlLmcuIHVwIHRvIDMwCmJsb2Nrcykgd2hpY2ggYXJlIG9mdGVuIGNsZWFu ZWQgb3V0IChoYXZpbmcgbW9zdCByb3dzIGRlbGV0ZWQpLg0KRG9lcyBhbnli b2R5IGFscmVhZHkgZmFjZWQgc3VjaCBiZWhhdmlvcj8gV2hhdCBjYW4gYmUg dGhlIHJvb3QgY2F1c2Ugb2Ygc3VjaApwcm9ibGVtcz8gQXJlIHRoZXJlIGFu eSByZWNvbW1lbmRhdGlvbnMgb24gaG93IHRvIGF2b2lkIHRoZW0/Cgo=
On Wed, Jun 1, 2016 at 10:48 PM, <timofeid@outlook.com> wrote: > We have an installation of Postgres 9.4.4(PostgreSQL 9.4.4 on > x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat > 4.4.7-11), 64-bit) on RHEL 6.6. DB installed on 2 nodes, 1 node is master= , > another node is hot standby(streaming replication). DB is monitored by > pac=D0=B5maker pgsql agent. You surely want to update to 9.4.8 first. You are missing many bug fixes. > Sometimes we have troubles with fsm-files. In case: > =E2=80=A2 master instance is switching to another node(failover or switc= hover) on > highload > =E2=80=A2 Hot standby node restart and run as master succesfully. > =E2=80=A2 After that we sometimes get FSM files pointing to non-existent= blocks in > the table, so subsequent insert operations on such tables fails with erro= r > message like following: 'could not read block XX in file "base/YYYY/ZZZZZ= "'. > The issue can be resolved by either deleting of wrong FSM file (while > database is stopped) or performing VACUUM FULL on erroneous table. The > problem is usually observed on relatively small tables (e.g. up to 30 > blocks) which are often cleaned out (having most rows deleted). > Does anybody already faced such behavior? What can be the root cause of s= uch > problems? Are there any recommendations on how to avoid them? Andres, do you think that c6ff84b0 can help here? Those symptoms look rather similar to some missing invalidation messages on the standby. --=20 Michael
On 2016-06-02 16:42:35 +0900, Michael Paquier wrote: > > Sometimes we have troubles with fsm-files. In case: > > ⢠master instance is switching to another node(failover or switchover) on > > highload > > ⢠Hot standby node restart and run as master succesfully. > > ⢠After that we sometimes get FSM files pointing to non-existent blocks in > > the table, so subsequent insert operations on such tables fails with error > > message like following: 'could not read block XX in file "base/YYYY/ZZZZZ"'. > > The issue can be resolved by either deleting of wrong FSM file (while > > database is stopped) or performing VACUUM FULL on erroneous table. The > > problem is usually observed on relatively small tables (e.g. up to 30 > > blocks) which are often cleaned out (having most rows deleted). > > Does anybody already faced such behavior? What can be the root cause of such > > problems? Are there any recommendations on how to avoid them? > > Andres, do you think that c6ff84b0 can help here? Those symptoms look > rather similar to some missing invalidation messages on the standby. If there was a restart involved, it seems unlikely that that'll be relevant. Timofei, do I understand correctly that the problem persists across restarts? Regards, Andres
On Fri, Jun 3, 2016 at 7:09 PM, Timofei Dynikov <timofeid@outlook.com> wrote: >> Date: Thu, 2 Jun 2016 07:42:32 -0700 andres@anarazel.de wrote: >> If there was a restart involved, it seems unlikely that that'll be >> relevant. Timofei, do I understand correctly that the problem persists >> across restarts? > > Yes, problem persists across restarts. We can resolve problem only by > performing VACUUM FULL or delete inconsistent FSM file. pacemaker removes recovery.conf and then restarts the node at failover, so the node moves on with a crash recovery on the same timeline in this case. I recall seeing cases where a relation file was truncated when crash recovery began in 9.4.4, that got fixed in 9.4.5. The environment where this happened made it hard to compile to reproduce it but I somewhat diagnosed this as being a side effect of be25a08, that e118555 fixed afterwards, at least I did not see anything else that could have been the origin of the problem between 9.4.4 and 9.4.5. The problem was in the same way happening on a small table, one that had no more than 5 tuples, and those were removed quite frequently to the table was most of the time empty, however when crash recovery began it had some records. Could you update to at least 9.4.5 and see if the problem goes away? We may as well have another problem hidden here.. -- Michael
> Date: Thu=2C 2 Jun 2016 07:42:32 -0700 > From: andres@anarazel.de > To: michael.paquier@gmail.com > CC: timofeid@outlook.com=3B pgsql-bugs@postgresql.org > Subject: Re: [BUGS] BUG #14171: Wrong FSM file after switching hot standb= y to master >=20 > On 2016-06-02 16:42:35 +0900=2C Michael Paquier wrote: > > > Sometimes we have troubles with fsm-files. In case: > > > =95 master instance is switching to another node(failover or switcho= ver) on > > > highload > > > =95 Hot standby node restart and run as master succesfully. > > > =95 After that we sometimes get FSM files pointing to non-existent b= locks in > > > the table=2C so subsequent insert operations on such tables fails wit= h error > > > message like following: 'could not read block XX in file "base/YYYY/Z= ZZZZ"'. > > > The issue can be resolved by either deleting of wrong FSM file (while > > > database is stopped) or performing VACUUM FULL on erroneous table. Th= e > > > problem is usually observed on relatively small tables (e.g. up to 30 > > > blocks) which are often cleaned out (having most rows deleted). > > > Does anybody already faced such behavior? What can be the root cause = of such > > > problems? Are there any recommendations on how to avoid them? > >=20 > > Andres=2C do you think that c6ff84b0 can help here? Those symptoms look > > rather similar to some missing invalidation messages on the standby. >=20 > If there was a restart involved=2C it seems unlikely that that'll be > relevant. Timofei=2C do I understand correctly that the problem persists > across restarts? >=20 > Regards=2C >=20 > Andres Yes=2C problem persists across restarts. We can resolve problem only by per= forming VACUUM FULL or delete inconsistent FSM file. Regards=2CTimofei Dynikov=20 =