could not read block XXXXX in file "base/YYYYY/ZZZZZZ": read only 160 of 8192 bytes - Mailing list pgsql-bugs
From | Антон Степаненко |
---|---|
Subject | could not read block XXXXX in file "base/YYYYY/ZZZZZZ": read only 160 of 8192 bytes |
Date | |
Msg-id | 1208551308255001@web137.yandex.ru Whole thread Raw |
Responses |
Re: could not read block XXXXX in file
"base/YYYYY/ZZZZZZ": read only 160 of 8192 bytes
|
List | pgsql-bugs |
Greetings. First and foremost - sorry for my bad english. I have PostgreSQL 9.0.4 installation using streaming replication, 1 master and 10 replicas. I've migrated to it about a monthago, from 8.3 and slony. Built-in replication is wonderful, slony sucks =), but I have some performance issues. In orderto manage with them I've tried lots of different things. One of them is rising shared_burffers from 8Gb to 12Gb (I have24Gb at all). After postgres restart he crahed in about 3 hours with messages: [2301-1] 2011-06-16 17:40:26 UTC ERROR: could not read block 34691 in file "base/17931/407169": read only 160 of 8192 bytes [2394-2] 2011-06-16 17:40:26 UTC STATEMENT: [some statement] [2390-1] 2011-06-16 17:40:26 UTC ERROR: could not read block 2242 in file "base/17931/18984": read only 160 of 8192 bytes [2394-2] 2011-06-16 17:40:26 UTC STATEMENT:[some statement] [2302-1] 2011-06-16 17:40:26 UTC ERROR: could not read block 34691 in file "base/17931/407169": read only 160 of 8192 bytes [2394-2] 2011-06-16 17:40:26 UTC STATEMENT: [some statement] [2391-1] 2011-06-16 17:40:26 UTC ERROR: could not read block 19926 in file "base/17931/686609": Bad address [2394-2] 2011-06-16 17:40:26 UTC STATEMENT: [some statement] [2392-1] 2011-06-16 17:40:26 UTC ERROR: could not read block 19926 in file "base/17931/686609": Bad address [2394-2] 2011-06-16 17:40:26 UTC STATEMENT: [some statement] [2393-1] 2011-06-16 17:40:26 UTC ERROR: could not read block 19926 in file "base/17931/686609": Bad address [2394-2] 2011-06-16 17:40:26 UTC STATEMENT: [some statement] [2394-1] 2011-06-16 17:40:26 UTC ERROR: could not read block 25578 in file "base/17931/686571": Bad address [2394-2] 2011-06-16 17:40:27 UTC STATEMENT: [some statement] [4-1] 2011-06-16 17:40:27 UTC LOG: startup process (PID 15292) was terminated by signal 7: Bus error [5-1] 2011-06-16 17:40:27 UTC LOG: terminating any other active server processes [858-1] 2011-06-16 17:40:27 UTC WARNING: terminating connection because of crash of another server process [858-2] 2011-06-16 17:40:27 UTC DETAIL: The postmaster has commanded this server process to roll back the current transactionand exit, because another server process exited abnormally and possibly corrupted shared memory. [2429-1] 2011-06-16 17:40:27 UTC WARNING: terminating connection because of crash of another server process [858-3] 2011-06-16 17:40:27 UTC HINT: In a moment you should be able to reconnect to the database and repeat your command. Despite "In a moment you should be able to reconnect to the database and repeat your command" postgres did nor restart byitself. Signal 7 means hardware problems. But all 10 replicas crashed within 10 minutes, say from 13:35 to 13:45. And when sharedbuffers were set to 8Gb I hadn't experienced such troubles. I checked /proc/sys/kernel/shmmax - 16Gb, /proc/sys/kernel/shmall - 4194304. I tried to set vm.overcommit_memory=2 and vm.overcommit_ratio=90 - no sense. Then I tried vm.overcommit_memory=1 - no sense. Here are vmstat logs (vmstat 10 > vmstat.log): procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----r b swpd free buff cache si so bi bo in cs us sy id wa0 0 0 5308644 0 0 0 0 508 133 0 743 2 1 97 00 0 05279436 0 0 0 0 620 307 0 1016 3 0 97 01 0 0 5266384 0 0 0 0 242 223 0 1104 8 1 92 01 0 0 5260544 0 0 0 0 359 287 0 959 2 0 98 00 0 0 5245764 0 0 0 0 877 311 0 891 3 0 97 00 0 0 24378380 0 0 0 0 454 314 0 1011 2 7 91 00 0 0 24378380 0 0 0 0 0 69 0 118 0 0 100 00 0 0 24378380 0 0 0 0 0 2 0 99 0 0 100 00 0 0 24378380 0 0 0 0 0 1 0 180 0 0 100 00 0 0 24378024 0 0 0 0 0 1 0 116 0 0 100 00 0 0 24378024 0 0 0 0 0 342 0 126 0 0 100 0 So you can see that postgres dies when there are about 5Gb of free memory. In one of my experiments postgres worked for about30 minutes with 0kb free memory. So I think problem is not memory lack. I thought about bad indexes and reindex. In messages like "could not read block XXXXX in file "base/YYYYY/ZZZZZZ": read only160 of 8192 bytes" YYYYY is my database and ZZZZZ somehow relates to a table or to an index. It may be relfilenode ofan index or oid of a table. And indexes and tables can be of any type, there are no regularity. I haven't tried to reindex,because after all 10 replicas went down. I made all experiments on only one of them, while reindexing will affectall installation, which is in production and under high load. One important thing - all replicas and master are running on openvz ubuntu lucid 2.6.32, and there is no way to reject virtualization(it is a long story =)) Please, I do not want to discuss my decision to set buffers to 12Gb and postgresql optimization at all. I just want to undestandwhy I'm getting such errors. I will appreciate any help, thank you in advance.
pgsql-bugs by date: